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M  Lb  U)  to 


(5)  Introduction 


Treatment  of  the  breast  cancer  at  an  early  stage  is  the  most  significant  means  of  improving 
the  survival  rate  of  the  patients.  Mammography  is  currently  the  most  sensitive  method  for  detecting 
early  breast  cancer,  and  it  is  also  the  most  practical  for  screening.  However,  the  positive  predictive 
value  of  mammographic  diagnosis  is  only  about  15%-30%.  As  the  number  of  patients  who  undergo 
mammography  increases,  it  will  be  increasingly  important  to  improve  the  positive  predictive  value 
of  mammography  in  order  to  reduce  costs  and  patient  discomfort.  In  this  proposal,  our  goal  is  to 
investigate  the  problem  of  classifying  mammographic  lesions  as  malignant  or  benign  using  computer 
vision,  automatic  feature  extraction,  statistical  classification,  and  artificial  intelligence  techniques. 
Our  efforts  are  concentrated  on  the  computer-aided  classification  of  two  kinds  of  breast 
abnormalities,  masses  and  microcalcifications,  which  are  the  primary  mammographic  signs  of 
malignancy.  We  are  investigating  computerized  extraction  of  useful  features  for  the  differentiation 
of  malignant  and  benign  cases  for  both  abnormalities,  and  the  application  of  classical  statistical 
classifiers  and  newly  developed  paradigms  such  as  neural  networks  and  genetic  algorithms  for  the 
classification  task.  Our  purposes  are  to  i)  improve  existing  techniques,  devise  new  methods,  and 
identify  the  preferred  approaches  for  the  classification  of  mammographic  lesions,  ii)  show  that 
computerized  classification  of  mammographic  lesions  is  feasible,  and  iii)  develop  a  computerized 
program  that  can  subsequently  be  shown  to  improve  radiologists’  classification  of  mammographic 
abnormalities. 
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(6)  Body 

In  the  fourth  year  (5/1/99-4/30/00)  of  this  grant,  we  have  performed  the  following  studies: 

(A)  Development  of  spiculation  features  for  classification  of  masses 

Spiculations  are  important  indicators  of  malignancy  on  mammograms.  In  the  third  year  of 
the  project,  we  had  reported  on  the  development  of  a  spiculation  detection  method.  After  the 
spiculations  were  detected  and  segmented,  the  shape  of  the  segmented  mass  was  modified  by 
appending  the  segmented  spiculations  to  the  core  of  the  mass.  Subsequently,  morphological 
features  were  extracted  from  the  segmented  mass  shape. 

In  the  fourth  year  of  the  project,  we  extracted  features  directly  related  to  the  degree  of 
spiculation  of  the  mass.  The  extraction  of  these  features  is  described  next.  Let  (ic,jc)  be  a  pixel  on 
the  mass  contour.  (Fig.  1).  We  first  define  a  search  region  S  as  shown  in  Fig.  2.  For  each  pixel 
(i,j)  in  S,  we  compute  the  angular  difference  0 between  the  image  gradient  direction  at  image  pixel 
(i,j),  and  the  direction  of  the  vector  joining  pixels  (ic,jc),  and  (i,j)  (Fig.  1).  If  the  pixel  (ijc)  lies  on 
the  path  of  a  spiculation,  then  6  will  be  close  to  7t/2  whenever  the  image  pixel  (i,j)  is  on  the 
spiculation.  Therefore,  the  distribution  of  9,  obtained  from  all  image  pixels  (i,j)  within  the  search 
region  S  will  have  a  peak  around  nil.  If  there  is  no  spiculation,  and  if  the  gray  levels  in  S  are 
randomly  distributed,  then  this  distribution  will  be  uniform.  This  was  the  basic  idea  behind  the 
spiculation  detection  method  reported  last  year.  In  the  fourth  year  of  the  project,  we  computed  the 
average  of  0  within  S  for  all  pixels  on  the  mass  boundary.  In  addition,  the  mass  boundary  was 
enlarged  one  pixel  at  a  time,  and  this  computation  was  repeated  in  a  30-pixel  wide  (3cm)  ring 
around  the  segmented  mass.  A  new  image,  called  the  spiculation  likelihood  map,  was  generated, 
in  which  the  gray-level  value  for  pixel  (ic,jc)  was  the  average  of  0  within  the  window  S  for  pixel 

(U)- 


the  angular  difference  6. 
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The  spiculation  likelihood  map  was  thresholded  at  a  constant  threshold.  The  pixels  above 
the  threshold  were  those  for  which  9  was  high,  i.e.,  those  for  which  the  likelihood  of  spiculation 
was  high.  Three  spiculation  measures  were  extracted  from  the  thresholded  spiculation  likelihood 
map.  These  were,  i)  the  number  of  objects  in  the  thresholded  image  (NPS),  which  was  related  to 
the  number  of  possible  spiculations,  ii)  the  percentage  area  of  the  objects  (PAS)  in  the  thresholded 
image  relative  to  the  area  of  the  30-pixel-wide  ring,  which  was  related  to  the  percentage  area  of 
spiculations,  and  iii)  the  product  of  these  two  measures  (PR). 

(B)  Classification  of  masses  using  morphological  and  spiculation  features 

These  three  spiculation  measures  were  used  in  addition  to  eleven  morphological  features 
extracted  from  the  mass  outline  for  mass  characterization.  The  morphological  features  were  those 
that  were  found  to  be  useful  for  mass  characterization  in  the  previous  years  of  our  project.  The 
first  five  morphological  features  were  based  on  the  normalized  radial  length  (NRL),  defined  as  the 
Euclidean  distance  from  the  object’s  centroid  to  each  of  its  edge  pixels  and  normalized  relative  to 
the  maximum  radial  length  for  the  object.  These  features  included  NRL  mean,  standard  deviation, 
entropy,  area  ratio,  and  zero  crossing  count.  The  remaining  six  morphological  features  included  the 
perimeter,  area,  perimeter-to-area  ratio,  circularity,  rectangularly,  and  contrast  of  the  object. 

The  training  and  test  sets  used  in  the  evaluation  of  the  classifier  were  completely 
independent.  Our  training  data  set  consisted  of  243  mammograms  (116  benign  and  127 
malignant)  from  101  patients.  Our  test  data  set  consisted  of  95  mammograms  (42  benign  and  53 
malignant)  from  45  patients.  A  single  view  was  available  for  nine  of  these  45  patients.  For  the 
remaining  36  test  patients,  two  or  more  views  were  available.  The  true  pathology  of  all  the 
masses  was  determined  by  biopsy  and  histologic  analysis. 

Stepwise  feature  selection  was  used  to  select  effective  features  for  classification  from  the 
feature  space  of  fourteen  features.  Four  features,  namely,  NPS,  PR,  contrast,  and  circularity  were 
selected  using  the  set  of  training  regions  of  interest  (ROIs).  A  backpropagation  neural  network 
(BPN)  with  four  input  nodes,  two  hidden-layer  nodes,  and  a  single  output  node  was  trained  using 
the  training  set.  The  accuracy  of  the  designed  classifier  was  evaluated  by  applying  the  classifier  to 
test  cases  that  had  not  been  used  for  training.  The  test  scores  were  analyzed  using  receiver 
operating  characteristic  (ROC)  methodology.  The  classification  accuracy  was  evaluated  as  the  area 
Az  under  the  ROC  curve. 

We  investigated  film-based  classification  of  the  masses  on  each  mammogram,  as  well  as 
case-based  classification  by  combining  possible  multiple  views  of  the  same  mass.  For  case- 
based  classification,  the  BPN  scores  from  different  views  were  averaged.  The  training  Az  values 
for  film-based  and  case-based  classification  were  0.91  and  0.95  respectively.  The  test  Az  values 
for  film-based  and  case-based  classification  were  0.81  and  0.87.  The  training  and  test  ROC 
curves  are  shown  in  Figs  3(a)  and  3(b),  respectively. 
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Figure  3  ROC  curves  for  film-based  and  case-based  classification,  (a)  Training  (b)  Test. 


The  difference  of  this  classification  method  from  the  independent  classification  reported 
last  year  are  the  following:  1)  The  current  classification  method  relies  only  on  morphological  and 
spiculation  features,  whereas  the  previous  classifier  was  based  on  texture  and  morphological 
features;  and  2)  In  the  current  algorithm,  we  merged  information  from  multiple  views  of  a  mass  to 
improve  the  classification  accuracy.  A  logical  next  step  is  to  combine  spiculation,  morphological, 
and  texture  features  for  classification.  However,  our  initial  attempts  to  perform  this  by  using  all 
features  in  LDA  with  stepwise  feature  selection  were  fruitless.  It  seems  that  the  spiculation 
features  are  very  dominant  in  classification,  and  once  they  are  selected,  the  inclusion  of  texture 
features  actually  decreases  the  classification  accuracy.  This  is  a  strong  indication  that  more 
sophisticated  classifier  is  required.  We  are  currently  evaluating  a  hierarchical  classifier  that  will 
use  spiculation  features  for  an  initial  classification,  followed  by  LDA  that  uses  texture  features. 
We  are  also  continuing  to  increase  our  mass  database  so  that  both  classifier  design  and  testing  can 
be  performed  with  larger  data  sets. 

(C)  Feature  extraction  from  computer-extracted  microcalcifications 


In  the  third  year  of  the  project,  we  had  reported  on  the  development  of  feature  extraction 
methods  from  manually  identified  microcalcifications.  In  the  fourth  year  of  the  project,  we 
investigated  feature  extraction  from  computer-detected  microcalcifications.  The  ROI  to  search  for 
the  microcalcifications  was  still  manually  identified.  After  the  ROI  was  chosen,  the 
microcalcifications  were  automatically  detected  in  the  ROI  containing  the  cluster.  Some  of  the 
detections  were  inevitably  false-positives,  i.e.,  non-calcified  points  that  were  brighter  than  their 
neighboring  pixels.  In  addition,  we  also  had  false-negatives,  for  example,  some  subtle 
microcalcifications  were  not  detected  by  the  detection  algorithm.  Since  it  will  not  be  possible  to 
manually  identify  the  microcalcifications  in  practice,  the  presence  of  these  false-negatives  and 
false-positives  represented  a  more  realistic  test  condition  for  our  characterization  algorithms. 
Since  our  purpose  in  this  project  is  lesion  characterization,  we  did  not  attempt  to  find  the  false¬ 
positive  and  false-negative  detection  rates  in  this  ROIs. 
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(D)  Computer  classification  for  automatically-detected  microcalcifications 

Starting  at  the  detected  locations,  the  shapes  of  microcalcifications  were  extracted  using  a 
region  growing  algorithm.  Five  morphological  features,  namely,  size,  mean  density,  ratio  of 
second  moments,  eccentricity  of  an  effective  ellipse,  and  ratio  of  major  and  minor  axes  of  the 
effective  ellipse,  were  extracted  from  each  segmented  microcalcification.  Since  the  variations  of 
the  shapes  and  sizes  of  the  individual  microcalcifications  within  a  cluster  are  important  for 
microcalcification  classification,  the  maximum,  mean,  standard  deviation,  and  coefficient  of 
variation  of  these  individual  features  were  computed  for  each  cluster.  The  number  of 
microcalcifications  in  a  cluster  was  also  used  as  a  morphological  feature.  These  twenty-one 
morphological  features  were  the  same  features  that  were  used  in  our  last  yearly  report  for 
microcalcification  classification. 

Four  gray  level  difference  statistics  features,  namely  mean,  entropy,  contrast,  and  angular 
second  moment  were  extracted  at  four  different  directions  from  the  ROI  containing  the 
microcalcification  cluster.  We  thus  had  16  texture  features. 

Texture  and  morphological  features  were  combined  for  classification  using  linear 
discriminant  analysis  with  stepwise  feature  selection.  The  data  set  for  computerized 
classification  consisted  of  112  pairs  (CC  and  MLO  or  CC  and  LAT)  of  mammograms.  The 
number  of  malignant  and  benign  pairs  were  40  and  72,  respectively.  The  mammograms  were 
digitized  with  a  Lumisys  DIS-1000  laser  scanner  at  a  pixel  size  Of  35mmX35mm  and  a  pixel 
depth  of  12  bits.  Leave-one-case-out  method  was  used  for  both  feature  selection  and  classifier 
parameter  estimation.  The  scores  from  the  two  views  of  a  pair  were  averaged  to  obtain  a  score 
for  the  pair.  Computer  classification  scores  were  analyzed  by  ROC  analysis.  The  accuracy  of 
the  classifier  was  evaluated  by  the  area  Az  and  the  partial  area  index  Az(TPF0)  above  a  true¬ 
positive  fraction  of  TPF0=0.90.  The  computer  classifier  had  an  ROC  area  of  0.83  and  a  partial 
area  index  of  0.42. 

(E)  Comparison  of  computer  classification  and  malignancy  assessment  by  radiologists  for 

microcalcifications 

We  conducted  an  ROC  study  in  which  7  MQSA-approved  radiologists  read  the  same  112 
pairs  of  ROIs.  The  ROIs  were  printed  on  film  with  a  laser  printer.  The  radiologists  rated  the 
likelihood  of  malignancy  of  each  pair  on  a  10-point  rating  scale.  The  case  order  was  randomized 
for  each  radiologist.  Radiologist  ratings  were  analyzed  with  ROC  methodology.  The  average 
ROC  curve  of  7  radiologists  was  computed  by  averaging  the  slope  and  intercept  parameters  of 
individual  ROC  curves.  The  classification  accuracy  of  the  radiologists  was  compared  to  that  of 
the  computer.  It  was  found  that  the  Az  value  of  the  computer  was  higher  than  that  of  all 
radiologists,  and  the  difference  was  statistically  significant  for  three  of  the  radiologists  (p=0.03). 
When  the  partial  area  index  above  TPR=0.90  was  analyzed,  it  was  found  that  computer 
characterization  was  significantly  more  accurate  than  all  radiologists  (p<0.05).  The  ROC  curves 
and  the  comparison  of  the  partial  area  index  are  shown  below. 
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Figure  4  The  comparison  of  the  computer  classifier  and  the  average  of  7  radiologists 


Partial  Area  Index  (TPF>0.9)  Difference 

Computer  Radiologist 

Two-tailed  p 
value 

R1 

0.42 

0.10 

0.32 

0.0045 

R2 

0.42 

0.05 

0.37 

0.0008 

R3 

0.42 

0.13 

0.29 

0.0158 

R4 

0.42 

0.09 

0.33 

0.0042 

R5 

0.42 

0.06 

0.36 

0.0018 

R6 

0.42 

0.14 

0.28 

0.0507 

R7 

0.42 

0.07 

0.35 

0.0022 

Ave.  ROC 

0.42 

0.09 

0.33 

Table  1  The  comparison  of  the  partial  area  index  (TPF=0.9)  between  the  computer  and  seven 
radiologists.  The  difference  between  the  computer  and  all  seven  radiologists  was 
statistically  significant. 
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(7)  Appendix 

1,  Key  research  accomplishments  in  current  year  as  a  result  of  this  grant 


•  Features  related  to  the  degree  of  spiculation  were  extracted  from  mammographic  masses 

•  A  classification  algorithm  that  relies  only  on  morphological  and  spiculation  features  was 
developed 

•  The  accuracy  of  the  mass  classification  algorithm  was  tested  on  a  completely  independent 
test  set.  The  combination  of  this  algorithm  with  texture  features  still  needs  to  be  performed. 

•  Morphological  features  were  extracted  from  computer-detected  microcalcifications.  This  is  a 
step  toward  more  realistic  implementation  of  the  classification  algorithm  compared  the 
previous  year,  in  which  we  had  used  hand-detected  microcalcifications. 

•  The  classification  algorithm  that  was  developed  in  year  three  was  applied  to  features 
extracted  from  computer-detected  microcalcifications. 

•  Using  an  observer  performance  study,  it  was  shown  that  the  developed  classifier  was 
significantly  more  accurate  than  experienced  radiologists  at  the  high-sensitivity  portion  of  the 
ROC  curve. 
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ABSTRACT 

In  computer-aided  diagnosis  (CAD),  a  frequently-used  approach  is  to  first  extract  several  potentially  useful  features 
from  a  data  set.  Effective  features  are  then  selected  from  this  feature  space,  and  a  classifier  is  designed  using  the  selected 
features.  In  this  study,  we  investigated  the  effect  of  finite  sample  size  on  classifier  accuracy  when  classifier  design  involves 
feature  selection.  The  feature  selection  and  classifier  coefficient  estimation  stages  of  classifier  design  were  implemented 
using  stepwise  feature  selection  and  Fisher’s  linear  discriminant  analysis,  respectively.  The  two  classes  used  in  our 
simulation  study  were  assumed  to  have  multidimensional  Gaussian  distributions,  with  a  large  number  of  features  available 
for  feature  selection.  We  investigated  the  effect  of  different  covariance  matrices  and  means  for  the  two  classes  on  feature 
selection  performance,  and  compared  two  strategies  for  sample  space  partitioning  for  classifier  design  and  testing.  Our  results 
indicated  that  the  resubstitution  estimate  was  always  optimistically  biased,  except  in  cases  where  too  few  features  were 
selected  by  the  stepwise  procedure.  When  feature  selection  was  performed  using  only  the  design  samples,  the  hold-out 
estimate  was  always  pessimistically  biased.  When  feature  selection  was  performed  using  the  entire  finite  sample  space,  and 
the  data  was  subsequently  partitioned  into  design  and  test  groups,  the  hold-out  estimates  could  be  pessimistically  or 
optimistically  biased,  depending  on  the  number  of  features  available  for  selection,  number  of  available  samples,  and  their 
statistical  distribution.  All  hold-out  estimates  exhibited  a  pessimistic  bias  when  the  parameters  of  the  simulation  were 
obtained  from  texture  features  extracted  from  mammograms  in  a  previous  study. 

Keywords:  feature  selection,  linear  discriminant  analysis,  effects  of  finite  sample  size,  computer-aided  diagnosis 


1.  INTRODUCTION 

A  common  problem  in  computer-aided  diagnosis  (CAD)  is  the  lack  of  a  large  number  of  image  samples  to  design  a 
classiher  and  to  test  its  performance.  The  effect  of  finite  sample  size  on  the  classification  accuracy  is  therefore  an  important 
research  topic.  In  order  to  treat  its  specific  components,  previous  studies  have  mostly  ignored  the  feature  selection 
component  of  this  problem,  and  assumed  that  the  features  used  in  the  classifier  were  fixed.1'4  However,  in  many  CAD 
algorithms,  feature  selection  is  a  necessary  first  step.  This  paper  addresses  the  effect  of  finite  sample  size  on  classification 
accuracy  when  the  classifier  design  involves  feature  selection. 

In  classifier  design,  the  resubstitution  and  hold-out  estimates  are  commonly  used  to  assess  the  accuracy  of  the 
classifier.  To  obtain  the  resubstitution  estimate,  the  classifier  is  designed  using  a  number  of  training  samples,  and  the  same 
samples  are  then  applied  to  the  classifier  to  yield  the  distribution  of  the  output  decision  variable  for  the  training  group.  The 
resubstitution  performance  of  the  classifier  is  then  measured  (e.g.,  by  computing  the  area  under  the  receiver  operating 
characteristic  curve,  or  by  evaluating  the  probability  of  misclassification)  using  this  distribution.  To  obtain  the  hold-out 
estimate,  the  classifier  is  designed  in  a  similar  way,  except  that  an  independent  set  of  test  samples  are  applied  to  the  classifier 
to  yield  the  distribution  of  the  output  decision  variable  for  the  test  group.  As  the  number  of  training  samples  increases,  both 
of  these  estimates  approach  the  true  classification  accuracy,  which  is  the  accuracy  of  a  classifier  designed  with  the  full 
knowledge  of  the  sample  distributions.  When  the  training  sample  size  is  finite,  it  is  known  that,  on  average,  the  resubstitution 
estimate*  of  classifier  accuracy  is  optimistic.  In  other  words,  it  has  a  higher  expected  value  than  the  performance  obtained 
with  an  infinite  design  sample  set,  which  is  the  true  classification  accuracy.  Similarly,  on  average,  the  hold-out  estimate  is 
pessimistic.  When  classifier  design  is  limited  by  the  availability  of  design  samples,  it  is  important  to  obtain  a  conservative 
(or  pessimistic)  performance  estimate,  which  provides  a  lower  bound  on  the  classification  accuracy. 


In  CAD  literature,  different  methods  have  been  used  to  estimate  the  classifier  accuracy  when  the  classifier  design 
involves  feature  selection.  In  a  few  studies,  only  the  resubstitution  estimate  was  provided.  ^  In  some  studies,  the  researchers 
partitioned  the  samples  into  training  and  test  groups  at  the  beginning  of  the  study,  performed  both  feature  selection  and 
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classifier  parameter  estimation  using  the  training  set,  and  provided  the  hold-out  performance  estimate.6  Several  other  studies 
used  a  mixture  of  the  two  methods:  The  entire  sample  space  was  used  as  the  training  set  at  the  feature  selection  step  of 
classifier  design,  but  once  the  features  were  chosen,  the  hold-out  or  leave-one-out  methods  were  used  to  measure  the 

accuracy  of  the  classifier.7'12  To  our  knowledge,  it  has  not  been  reported  whether  this  latter  method  provides  an  optimistic 
or  pessimistic  estimate  of  the  classifier  performance. 

This  paper  describes  a  simulation  study  that  investigates  the  effect  of  finite  sample  size  on  classifier  accuracy  when 
classifier  design  involves  feature  selection.  We  chose  to  focus  our  attention  on  stepwise  feature  selection  in  linear 
discriminant  analysis  (stepwise  linear  discriminant  analysis)  since  this  is  a  simple  and  common  feature  selection  and 
classification  method.  The  class  distributions  were  assumed  to  be  multivariate  Gaussian.  We  studied  the  effect  of  different 
covariance  matrices  and  means  on  feature  selection  performance.  We  compared  the  bias  of  the  classifier  when  feature 
selection  was  performed  on  the  entire  sample  space,  and  on  the  design  samples  alone.  The  effects  of  sample  size,  number  of 
available  features,  and  parameters  of  stepwise  feature  selection  on  classifier  bias  were  examined. 

2.  METHODS 

To  evaluate  the  effect  of  sample  size  on  feature  selection  and  classifier  bias,  we  studied  the  problem  of  stepwise 
linear  discriminant  analysis  in  two  stages.  The  first  stage  is  stepwise  feature  selection,  and  the  second  stage  is  the  estimation 
of  linear  discriminant  coefficients  for  the  selected  feature  subset. 

2.1.  Stepwise  Feature  Selection 

Stepwise  feature  selection  iteratively  enters  features  into  or  removes  features  from  the  group  of  selected  features 
based  on  a  feature  selection  criterion.12  In  our  study,  we  used  Wilks’  lambda,  which  is  defined  as  the  ratio  of  within-group 
sum  of  squares  to  the  total  sum  of  squares  of  the  discriminant  scores,  as  the  feature  selection  criterion.  At  the  feature'entry 
step  of  the  stepwise  algorithm,  an  F  value  is  computed  for  each  feature  based  on  the  ratio  of  the  Wilks'  lambda  before  and 
after  the  feature  is  entered  into  the  pool  of  already  selected  features.  The  feature  with  the  largest  F  value  is  entered  into  the 
selected  feature  pool  if  the  F  value  is  larger  than  a  threshold  Ftn.  At  the  feature  removal  step,  the  features  are  tested  for 
^  removal  one  at  a  time  from  the  selected  feature  pool,  the  F  values  are  computed,  and  the  feature  with  the  smallest  F  value  is 
removed  from  the  selected  feature  pool  if  the  F  value  is  smaller  than  a  threshold  Fout.  The  algorithm  terminates  when  no 
more  features  can  satisfy  the  criteria  for  either  entry  or  removal.  The  number  of  features  selected  therefore  increases,  in 
general,  when  Fin  or  Fout  are  reduced. 

2.2.  Estimation  of  Linear  Discriminant  Coefficients 

As  a  by-product  of  the  stepwise  feature  selection  procedure  used  in  our  study,  the  coefficients  of  a  linear  classifier 
that  classifies  its  design  samples  using  the  selected  features  are  also  computed.  However,  in  this  study,  the  design  samples 
used  in  the  stepwise  feature  selection  step  of  classifier  design  may  be  different  from  those  used  in  the  estimation  of  classifier 
coefficients.  Therefore,  we  implemented  the  stepwise  feature  selection  and  the  classifier  coefficient  estimation  components 
of  our  classification  scheme  separately. 

Let  Zj  and  In  denote  the  k-by-k  covariance  matrices  of  samples  belonging  to  class  1  and  class  2,  and  let 
'F-1  =(1*1(1  )>l*j( 2 \i]( k  ))  denote  their  mean  vectors.  For  an  input  vector  X ,  the  linear  discriminant  classifier  output  is 
defined  as 

h(x')  =  l(M2  ~^]jr^~IX  +  ),  (1) 

where  Z=(Zj-fZz)/2.  The  linear  discriminant  classifier  is  the  optimal  classifier  when  the  two  classes  have  a  multivariate 
Gaussian  distribution  with  equal  covariance  matrices. 

For  the  class  separation  measures  considered  in  this  paper  (refer  to  Section  2.3),  the  constant  term 

I'  —  i  "p  —J 

( lij  Z  \Lj  F-2  ln  EQ'  (1)  ls  irrelevant.  Therefore,  the  classifier  design  can  be  viewed  as  the  estimation  of  k 

parameters  of  the  vector  ( JJ.2  ~  !*}  F  Z  ^  using  the  design  samples. 
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When  a  finite  number  of  design  samples  are  available,  the  means  and  covariances  are  estimated  as  the  sample  means 
and  the  sample  covariances  from  the  design  samples.  The  substitution  of  true  means  and  covariances  in  Eq.  (1)  by  their 
estimates  causes  a  bias  in  the  accuracy  of  the  classifier.  In  particular,  if  the  designed  classifier  is  used  for  the  classification  of 
design  samples,  then  the  performance  is  optimistically  biased,  and  if  the  classifier  is  used  for  classifying  test  samples  that  are 
independent  from  the  design  samples,  then  the  performance  is  pessimistically  biased. 

2.3.  Measures  of  Class  Separation 

2.3.1.  Infinite  sample  size 

When  an  infinite  sample  size  is  available,  the  class  means  and  covariance  matrices  can  be  estimated  without  bias 
(i.e.,  these  quantities  can  be  assumed  to  be  known).  In  this  case,  we  used  the  Mahalanobis  distance  A(°o),  or  the  area  A^°°) 
under  the  receiver  operating  characteristic  (ROC)  curve  as  measures  of  classifier  accuracy.  The  infinity  sign  in  parentheses 
reflects  the  fact  that  the  distance  is  computed  using  the  true  means  and  covariance  matrices,  or,  equivalently,  using  an  infinite 
number  of  samples. 


Assume  that  the  two  classes  with  a  multivariate  Gaussian  distribution  with  equal  covariance  matrices  have  been 
classified  using  Eq.  (1).  Since  Eq.  (1)  is  a  linear  function  of  the  feature  vector  X,  the  classifier  outputs  for  class  1  and  class  2 
will  be  Gaussian.  Let  ml  and  m2  denote  means  of  the  classifier  output  for  the  normals  and  the  abnormals,  respectively,  and  let 

sj  and  s\  denote  the  variances  for  the  two  classes.  With  A(oo)  defined  as 

=  f  h2-PLi  ),  (2) 

it  can  easily  be  shown  that 

m2 -mj  =  sj  =  St  =  A(  °°).  (2) 

The  quantity  A(°°)  is  referred  to  as  the  Mahalanobis  distance  between  the  two  classes.  It  is  the  Euclidean  distance 
between  the  two  classes,  normalized  to  the  common  covariance  matrix. 

In  particular,  if  Z  is  an  k-by-k  diagonal  matrix  with  Zt ,  =  <y~(  i ),  then 

d(oe)='LS(i ),  (4) 

i~l 

where 

5<i)  =  [fi2  (i)~Mi(i)l2 /<?2(‘)  (5) 

is  the  squared  signal-to-noise  ratio  of  the  difference  of  the  means  between  the  two  classes  for  the  Ith  feature. 


Using  Eq.  (3),  and  the  normality  of  the  classifier  outputs,  it  can  be  shown  that14 

/  -JUl  2 ,, 

Az(°°)  =  -]Lr  j  e~‘  /2dt 


2.3.2.  Finite  sample  size 

When  a  finite  sample  size  is  available,  the  means  and  covariances  of  the  two  class  distributions  were  estimated  as 
the  sample  means  and  the  sample  covariances  using  the  training  samples,  and  the  classifier  outputs  for  the  training  and  test 
samples  were  computed  using  Eq.  (I).  The  accuracy  of  the  classifier  was  measured  by  receiver  operating  characteristic 

(ROC)  methodology.15'16  The  discriminant  scores  for  samples  belonging  to  class  1  and  class  2  were  used  as  decision 
variables*in  the  LABROC1  program,  which  provided  the  ROC  curve  based  on  maximum  likelihood  estimation. 

2.4.  Simulation  conditions 

For  our  simulations,  we  assumed  that  the  two  classes  have  a  multivariate  Gaussian  distribution  with  equal 
covariance  matrices,  and  different  means.  The  number  of  available  features  was  M=100.  We  generated  a  sample  size  of  Ns 
samples  from  each  class  using  a  random  number  generator.  The  sample  space  was  randomly  partitioned  into  Nt  training 
samples  and  Ns-Nt  test  samples  per  class.  For  a  given  sample  space,  we  used  several  different  values  for  Nt  in  order  to  study 
the  effect  of  the  design  sample  size  on  classification  accuracy.  In  order  to  reduce  the  variance  of  the  classification  accuracy 
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estimate,  a  given  sample  space  was  independently  partitioned  20  times  into  N,  training  samples  and  Ns-Nt  test  samples  per 
class,  and  the  classification  accuracy  using  these  20  partitions  was  averaged.  The  procedure  described  above  was  referred  to 
as  an  experiment.  For  each  simulation  condition  described  below,  50  statistically  independent  experiments  were  performed, 
and  the  results  were  averaged. 

Two  methods  for  feature  selection  were  considered.  In  the  first  method,  the  entire  sample  space  was  used  for  feature 
selection.  In  other  words,  the  entire  sample  space  was  treated  as  a  training  set  at  the  feature  selection  step  of  classifier  design. 
Before  the  coefficient  estimation  step  of  classifier  design,  the  sample  space  was  partitioned  into  training  and  test  groups.  The 
training  group  was  used  for  classifier  coefficient  estimation,  and  the  resubstitution  and  hold-out  performances  were  estimated 
by  applying  the  training  and  test  groups  to  the  designed  classifier,  respectively.  In  the  second  method,  sample  set  partitioning 
was  performed  before  feature  selection.  In  other  words,  both  feature  selection  and  coefficient  estimation  were  performed 
only  on  the  training  set. 


Case  1 :  Comparison  of  correlated  and  diagonal  covariance  matrices 


Case  La 

In  this  simulation  condition,  the  100X100  covariance  matrix  Z  was  chosen  to  have  a 
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and  d/if/j-0.1732  for  all  i.  Using  (2),  the  Mahalanobis  distance  is  computed  as  3.0,  and  A,f<»)=0.89. 


Case  Lb 

The  features  in  Case  l.a  can  be  transformed  into  a  set  of  uncorrelated  features  using  a  linear  transformation,  which  is 
called  the  orthogonalization  transformation.  The  linear  orthogonalization  transformation  is  defined  by  the  eigenvector  matrix 
of  Z,  so  that  the  covariance  matrix  after  orthogonalization  is  diagonal.  After  the  transformation,  the  new  covariance  matrix 
turns  out  to  be  the  identity  matrix,  and  the  new  mean  vector  is 

\0.5477  if  /  is  a  multiple  of  10 
otherwise 

Since  a  linear  transformation  will  not  affect  the  separability  of  the  two  classes,  the  Mahalanobis  distance  is  the  same 
as  in  Case  l.a,  i.e.,  A(*>)= 3.0. 


Case  2:  Simulation  of  a  possible  condition  in  CAD 

In  order  to  simulate  covariance  matrices  and  mean  vectors  that  one  may  encounter  in  CAD,  we  used  texture  features 
extracted  from  patient  mammograms  in  a  previous  study,  which  aimed  at  classifying  regions  of  interest  (ROIs)  on 

mammograms  as  malignant  or  benign.7  Ten  different  spatial  gray  level  dependence  (SGLD)  texture  measures  were  extracted 
from  each  ROI  at  five  different  distances  and  two  directions.  The  number  of  available  features  was  therefore  M=100.  The 
transformations  that  were  applied  to  the  ROI  before  feature  extraction,  and  the  formal  definition  of  SGLD  features  can  be 
found  in  the  literature.7*17  The  means  and  covariances  for  each  class  were  estimated  from  a  database  of  249  mammograms. 


Case  2.  a 
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In  this  simulation  condition,  the  two  classes  were  assumed  to  have  a  multivariate  Gaussian  distribution  with 
Z=(Z,+Zn)/ 2,  where  Z,  and  Z2  were  estimated  from  the  feature  samples  for  the  malignant  and  benign  classes.  Since  the 
features  have  different  scales,  their  variances  can  vary  by  as  much  as  a  factor  of  106.  Therefore,  it  is  difficult  to  provide  an 
idea  about  how  the  covariance  matrix  is  distributed  without  listing  all  the  entries  of  the  100X100  matrix  Z.  The  correlation 
matrix,  which  is  normalized  so  that  all  diagonal  entries  are  unity,  is  better  suited  for  this  purpose.  The  absolute  value  of  the 
correlation  matrix  is  shown  as  an  image  in  Fig.  1.  In  this  image,  small  elements  of  the  correlation  matrix  are  displayed  as 
darker  pixels,  and  the  diagonal  elements,  which  are  unity,  are  displayed  as  brighter  pixels.  From  Fig.  2,  it  is  observed  that 

some  of  the  features  are  highly  correlated  or  anticorrelated.  The  Mahalanobis  distance  was  computed  as  A(°°)=2.4,  which 
implied  Az(<x>)-0.%6. 

Case  2.b 

To  determine  the  performance  of  a  feature  space  with  equivalent  discrimination  potential,  but  independent  features, 
we  performed  an  orthogonalization  transformation  on  the  SGLD  feature  space,  as  explained  previously  (Case  l.b). 

3.  RESULTS 

Case  1 : 

Feature  selection  from  the  entire  sample  space 

.  Figs.  2.a  and  2.b  plot  the  area  A,  under  the  ROC  curve  for  the  resubstitution  and  hold-out  performance  estimates 
versus  the  inverse  of  the  number  of  training  samples  per  class,  1/N„  for  Case  l.a,  and  Case  l.b,  respectively  (number  of 
samples  per  class  N^lOO).  The  Fin  value  was  varied  between  0.5  and  1.5,  and  Fou,  was  defined  as  Fou,=max[(F,n-l),0].  Fig. 
3  is  equivalent  to  Fig.  2.a,  except  the  number  of  samples  per  class  was  increased  from  Ns=100  to  /V,=500  in  this  figure. 

Case  2: 

Feature  selection  from  the  entire  sample  space 

The  area  A.  under  the  ROC  curve  for  the  resubstitution  and  hold-out  performance  estimates  are  plotted  versus  UN,  in 
Figs.  4.a  and  4.b  for  Case  2.a.  and  Case  2.b,  respectively  (^=100).  The  Fm  value  was  varied  between  0.5  and  3.0,  and  Fou( 
was  defined  as  Foul=max[(F,„-l),0].  Fig.  5  is  equivalent  to  Fig.  4.a,  except  the  number  of  samples  per  class  was  increased 
from  Ns=\00  to  A^—500  in  this  figure. 

Feature  selection  from  training  samples  alone 

Case  2.a  was  used  as  an  example.  The  area  A,  under  the  ROC  curves  versus  1/Nt  are  plotted  for  Ns=  100  and  Ns=5 00 
in  Figs.  6  and  7,  respectively. 


4.  DISCUSSION 

Fig.  2.b  demonstrates  the  potential  disadvantage  of  performing  feature  selection  using  the  entire  sample  space.  The 
best  possible  test  performance  with  infinite  sample  size  for  Case  1  is  Az( ^)=0.89.  However,  in  Fig.  2.b,  we  observe  that  some 
of  the  “hold-out”  estimates  were  as  high  as  0.92.  These  estimates  were  higher  than  Az(<*>)  because  the  hold-out  samples  were 
excluded  from  classifier  design  only  in  the  parameter  estimation  stage  of  the  design,  and  were  used  as  training  samples  in 
feature  selection.  When  feature  selection  is  performed  using  a  small  sample  size,  some  features  that  are  useless  for  the 
general  population  may  appear  to  be  useful  for  the  classification  of  the  small  number  of  samples  at  hand.  This  was 
previously  demonstrated  in  the  literature  by  comparing  the  probability  of  misclassification  based  on  either  a  finite  sample  set 
or  the  entire  population  subject  to  the  constraint  that  a  given  number  of  features  were  used  for  classification.^  In  our  study, 
given  a  small  data  set,  the  variance  in  Wilks  lambda  estimates  causes  some  feature  combinations  to  appear  more  powerful 
than  they  actually  are.  If  the  data  set  is  partitioned  into  training  and  test  groups  after  feature  selection,  these  feature 
combinations  may  provide  optimistic  hold-out  estimates. 

Tfie  observation  made  in  the  previous  paragraph  about  feature  selection  using  the  entire  sample  space  is  not  a 
general  rule,  however.  Figs.  2. a  and  4. a  show  that  one  does  not  always  run  the  risk  of  obtaining  an  optimistic  bias  in  the 
hold-out  estimate  when  the  feature  selection  is  performed  using  the  entire  sample  space.  For  Case  1,  the  best  possible  test 
performance  with  an  infinite  sample  size  is  AJ[<»)= 0.89,  but  the  best  hold-out  estimate  in  Fig.  2.a  is  A.=0.82.  Similarly,  for 
Case  2,  the  best  possible  test  performance  with  infinite  sample  size  is  A/»)=0.86,  but  the  best  hold-ouf  estimate  in  Fig.  4.  a  is 
A .=0.84.  The  features  in  both  Case  l.a  and  Case  2,a  were  correlated.  Case  l.b  and  Case  2.b  were  obtained  from  Case”  .a  and 
Case  2. a  by  applying  a  linear  orthogonalization  transformation  to  the  features  so  that  they  become  uncorrelated.  Figs.  2.b  and 
4.b  show  that  after  this  transformation  is  applied,  the  hold-out  estimates  can  be  optimistically  biased  for  small  sample  size 
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(A^lOO ).  This  shows  that  performing  a  linear  combination  of  features  before  stepwise  feature  selection  can  have  a  dramatic 
influence  on  its  performance.  This  result  is  somewhat  surprising,  because  the  stepwise  procedure  is  known  to  select  a  set  of 
features  whose  linear  combination  can  effectively  separate  the  classes.  However,  the  ortho gonalization  transformation  in  this 
study  is  assumed  to  be  known  a  priori  (i.e.,  it  is  not  deduced  from  the  available  finite  sample  size),  and  is  applied  to  the  entire 
feature  space  of  M  features,  whereas  the  stepwise  procedure  only  produces  combinations  of  a  subset  of  these  features. 

Figs.  6  and  7  demonstrate  that  when  feature  selection  is  performed  using  the  training  set  alone,  the  hold-out 
performance  estimate  is  pessimistically  biased.  This  bias  decreases  as  the  number  of  training  samples,  Nh  is  increased. 

When  Fin  and  Fow  values  were  low,  the  resubstitution  performance  estimates  were  optimistically  biased  for  all  the 
cases  studied.  Low  Fin  and  Fou[  values  imply  that  many  features  are  selected  using  the  stepwise  procedure.  From  previous 
studies,  it  is  known  that  a  larger  number  of  features  in  classification  leads  to  larger  resubstitution  bias.3  On  the  other  hand, 
when  Fin  and  Fout  values  were  very  high,  the  number  of  selected  features  could  be  so  low  that  the  resubstitution  estimate 
would  be  pessimistically  biased,  as  can  be  observed  from  Fig.  3  (Fin=  1.5)  and  Fig.  4. a  (Fin~3.0).  In  all  of  our  simulations,  for 
a  given  number  of  training  samples  N(,  the  resubstitution  estimate  increased  monotonically  as  the  number  of  selected  features 
were  increased  by  decreasing  Fin  and  Fout. 

In  contrast  to  the  resubstitution  estimate,  the  hold-out  estimate  for  a  given  number  of  training  samples  did  not 
change  monotonically  as  F ^  and  Foui  were  decreased.  This  can  be  observed  from  Fig.  2.a,  where  the  hold-out  estimate  for 
Fin=1.5  is  larger  than  all  other  hold-out  estimates  with  different  Fin  values  for  Nf= 25  (1/Nf=  0.04).  However,  for  90 

(1/Nt= 0.01 1),  the  hold-out  estimate  for  the  same  Fin  value  is  no  longer  the  largest.  In  Fig.  2. a,  the  feature  selection  was 
performed  using  the  entire  sample  space.  A  similar  phenomenon  can  be  observed  in  Fig.  7,  where  the  feature  selection  is 
performed  using  the  training  samples  alone.  This  means  that  for  a  given  number  of  design  samples,  there  is  an  optimum 
value  for  Fin  and  Fout  (or  the  number  of  selected  features)  that  provides  the  highest  hold-out  estimate.  This  is  the  well-known 
peaking  phenomenon  described  in  the  literature,19  which  can  be  explained  as  follows.  For  a  given  number  of  training 
samples,  increasing  the  number  of  features  in  the  classification  has  two  opposing  effects  on  the  hold-out  performance.  On 
the  one  hand,  the  new  features  may  provide  some  new  information  about  the  two  classes,  which  tends  to  increase  the  hold-out 
performance.  On  the  other  hand,  the  same  features  increase  the  complexity  of  the  classifier,  which  tends  to  decrease  the 
hold-out  performance.  Depending  on  the  balance  between  how  much  new  information  the  new  features  provide  and  how 
-  much  the  complexity  increases,  the  hold-out  performance  may  increase  or  decrease  when  the  number  of  features  is  increased. 

In  this  study,  the  number  of  available  features  was  fixed  at  Af=100.  The  number  of  samples  per  class  was  A^lOO  in 
most  of  the  simulations.  However,  in  three  of  our  simulation  conditions,  we  used  N^=5 00,  which  meant  that  the  total  number 
of  samples  was  ten  times  that  of  available  features.  The  results  of  these  simulations  are  shown  in  Fig.  3  for  Case  1,  and  Figs. 
5  and  7  for  Case  2.  Our  first  observation  concerning  these  figures  is  that  no  hold-out  estimates  in  any  of  these  figures  are 
higher  than  their  respective  A/®)  values.  This  suggests  that  optimistic  hold-out  estimates  may  be  avoided  by  increasing  the 
'  number  of  available  samples,  or,  possibly,  by  decreasing  the  number  of  features  used  for  feature  selection.  A  second 
observation  is  that,  compared  to  other  figures  in  this  study,  the  relationship  between  the  Az  values  and  1/Nt  is  closer  to  a  linear 
relation.  This  suggests  that  it  may  be  possible  to  obtain  Az(°°)  by  fitting  a  line  to  the  Az  vs.  1/N(  curves  using  linear 
regression,  and  finding  the  y-axis  intercept.  This  is  similar  to  the  modified  Fukunaga  and  Hayes  technique  that  we  discussed 
previously  in  the  studies  of  finite  sample  size  effect  on  classifier  bias. 

This  study  examined  only  the  bias  of  the  mean  performance  estimates,  which  were  obtained  by  averaging  the 
estimates  from  fifty  experiments  as  described  in  Section  2.4.  Another  important  issue  in  classifier  design  is  the  variance  of 
the  individual  estimates.  The  variance  provides  an  estimate  of  the  generalizability  of  the  classifier  performance  to  other 
design  and  test  samples.  We  previously  studied  the  variance  of  performance  estimates  when  the  classifier  design  included 

the  estimation  of  classifier  coefficients,  but  excluded  feature  selection.^O  The  extension  of  our  previous  studies  to  include 
feature  selection  is  an  important  further  research  topic. 

5.  CONCLUSION 

In  this  study,  we  investigated  the  finite-sample  performance  of  a  linear  classifier  that  included  stepwise  feature 
selection  as  a  design  step.  We  compared  the  resubstitution  and  hold-out  estimates  to  the  true  classification  accuracy,  which  is 
the  accuracy  of  a  classifier  designed  with  the  full  knowledge  of  the  sample  distributions.  We  compared  the  effect  of 
partitioning  the  data  set  into  training  and  test  groups  before  performing  feature  selection,  and  after  performing  feature 
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selection.  When  data  partitioning  was  performed  before  feature  selection,  the  hold-out  estimate  was  always  pessimistically 
biased.  When  partitioning  was  performed  after  feature  selection,  i.e.,  the  entire  sample  space  was  used  for.  feature  selection, 
the  hold-out  estimates  could  be  pessimistically  or  optimistically  biased,  depending  on  the  number  of  features  available  for 
selection,  number  of  available  samples,  and  their  statistical  distribution.  All  hold-out  estimates  exhibited  a  pessimistic  bias 
when  the  parameters  of  the  simulation  were  obtained  from  correlated  texture  features  extracted  from  mammograms  in  our 
previous  study.  The  understanding  of  the  performance  of  the  classifier  designed  with  different  schemes  will  allow  us  to 
utilize  a  limited  sample  set  efficiently  and  to  avoid  an  overly  optimistic  assessment  of  the  classifier 
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Fig.  1  The  absolute  value  of  the  correlation  matrix  for  the  100-dimensional  texture  feature  space  extracted  from  249 
mammograms.  The  covariance  matrix  corresponding  to  these  features  was  used  in  simulation  Case  2. a. 
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a  The  area  A.  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  Nr  per  class  for  Case 
l.a,  feature  selection  from  the  entire  sample  space  of  100  samples/class.  Feature  selection  was  performed 
using  an  input  feature  space  of  A/=100  available  features.  Aj( ^=0.89. 


Fig.  2.b  The  area  .4.  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  N,  per  class  for  Case 
l.b,  feature  selection  from  the  entire  sample  space  of  100  samples/class.  Feature  selection  was  performed 
using  an  input  feature  space  of  M=100  available  features.  A/»j=0.S9. 
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The  area  A.  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  N,  per  class  for  Case 
l.a,  feature  selection  from  the  entire  sample  space  of  500  samples/class.  Feature  selection  was  performed 
using  an  input  feature  space  of  M=  100  available  features.  Ar("M).89. 
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Fig.  4. a  The  area  Az  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  Nt  per  class  for  Case 
2. a,  feature  selection  from  the  entire  sample  space  of  100  samples/class.  Feature  selection  was  performed 
using  an  input  feature  space  of  M-100  available  features.  A/»J=0.86. 
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The  area  A.  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  Nt  per  class  for  Case 
_.b,  feature  selection  from  the  entire  sample  space  of  100  samples/class.  Feature  selection  was  performed 
using  an  input  feature  space  of  A/=100  available  features.  A^°°)= 0.86. 


The  area  A.  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  N,  per  class  for  Case 
2.a.  feature  selection  from  the  entire  sample  space  of  500  samples/class.  Feature  selection  was  performed 
using  an  input  feature  space  of  M=  100  available  features.  0.86. 
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The  area  A.  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  N{  per  class  for  Case 
2. a,  feature  selection  from  design  samples  alone  (Ns=  100).  Feature  selection  was  performed  using  an  input 
feature  space  of  M=100  available  features.  A/<®)= 0.86. 


0.000  0.005  0.010  0.015  0.020  0.025 


1/Nt 

Fig.  7  The  area  Az  under  the  ROC  curve  versus  the  inverse  of  the  number  of  design  samples  N,  per  class  for  Case 
2.a,  feature  selection  from  design  samples  alone  (N^SOO).  Feature  selection  was  performed  using  an  input 
feature  space  of  M- 100  available  features.  Az( »)=0.86. 
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ABSTRACT 

A  hybrid  classifier  which  combines  an  unsupervised  adaptive  resonance  network  (ART2)  and  a  supervised  linear 
discriminant  classifier  (LDA)  was  developed  for  analysis  of  mammographic  masses.  Initially  the  ART2  network  separates  the 
masses  into  different  classes  based  on  the  similarity  of  the  input  feature  vectors.  The  resulting  classes  are  subsequently 
divided  into  two  groups:  (i)  classes  containing  only  malignant  masses  and  (ii)  classes  containing  both  malignant  and  benign 
or  only  benign  masses.  Ail  masses  belonging  to  the  second  group  are  used  to  formulate  a  single  LDA  model  to  classify  them 
as  malignant  and  benign.  In  this  approach,  the  ART2  network  identifies  the  highly  suspicious  malignant  cases  and  removes 
them  from  the  training  set,  thereby  facilitating  the  formulation  of  the  LDA  model.  In  order  to  examine  the  utility  of  this 
approach,  a  data  set  of  348  regions  of  interest  (ROIs)  containing  biopsy-proven  masses  (169  benign  and  179  malignant)  were 
used.  Ten  different  partitions  of  training  and  test  groups  were  randomly  generated  using  73%  of  ROIs  for  training  and  27% 
for  testing.  Classifier  design  including  feature  selection  and  weight  optimization  was  performed  with  the  training  group.  The 
test  group  was  kept  independent  of  the  training  group.  The  performance  of  the  hybrid  classifier  was  compared  to  that  of  an 
LDA  classifier  alone.  Receiver  Operating  Characteristics  (ROC)  analysis  was  used  to  evaluate  the  accuracy  of  the  classifier. 
The  average  area  under  the  ROC  curve  (AJ  for  the  hybrid  classifier  was  0.81  as  compared  to  0.78  for  LDA.  The  Az  values 
for  the  partial  areas  above  a  true  positive  fraction  of  0.9  were  0.34  and  0.27  for  the  hybrid  and  the  LDA  classifier, 
respectively.  These  results  indicate  that  the  hybrid  classifier  is  a  promising  approach  for  improving  the  accuracy  of 
classification  in  CAD  applications. 


1.  INTRODUCTION 

Mammography  is  the  most  effective  method  for  detection  of  early  breast  cancer1.  However,  the  specificity  for 
classification  of  malignant  and  benign  lesions  from  mammographic  images  is  relatively  low.  Clinical  studies  have  shown 
that  the  positive  predictive  value  (i.e.,  ratio  of  the  number  of  breast  cancers  found  to  the  total  number  of  biopsies)  is  only 
15%  to  30%  2'3.  It  is  important  to  increase  the  positive  predictive  value  without  reducing  the  sensitivity  of  breast  cancer 
detection.  Computer-aided  diagnosis  (CAD)  has  the  potential  to  increase  the  diagnostic  accuracy  by  reducing  the  false¬ 
negative  rate  while  increasing  the  positive  predictive  values  of  mammographic  abnormalities. 

ClassifieCdesign  is  an  important  step  in  the  development  of  a  CAD  system.  A  classifier  has  to  be  able  to  merge  the 
available  input  feature  information  and  make  a  correct  evaluation.  Commonly  used  classifiers  for  CAD  include  linear 
discriminants  (LDA)4  and  backpropagation  neural  networks  (BPN)5  which  have  been  shown  to  perform  well  in  lesion 
classification  problems6*9.  These  classifiers  are  generally  designed  by  supervised  training.  However,  these  types  of 
classifiers  have  limitations  dealing  with  the  nonlinearities  in  the  data  (in  case  of  LDA)  and  in  generalizability  when  a  limited 
number  of  training  samples  are  available  (especially  BPN).  Another  classification  approach  is  based  on  unsupervised 
classifiers,  which  cluster  the  data  into  different  classes  based  on  the  similarities  in  the  properties  of  the  input  feature  vectors.  p 
Therefore,  unsupervised  classifiers  can  be  used  to  analyze  the  similarities  within  the  data.  However,  it  is  difficult  to  use  them 
as  a  discriminatory  classifier16,17. 

We  propose  here  a  hybrid  unsupervised/supervised  structure  to  improve  classification  performance.  The  design  of 
this  structure  was  inspired  by  neural  information  processing  principles  such  as  seif-organization,  decentralization  and 
generalization.  It  combines  the  Adaptive  Resonance  Theory  network  (ART2)14,15  and  the  LDA  classifier  as  a  cascade  system 
(ART2LDA).  The  self-organizing  unsupervised  ART2  network  automatically  decomposes  the  input  samples  into  classes 
with  different  properties.  The  ART2  network  performs  better  compared  to  conventional  clustering  techniques  in  terms  of 
learning  speed  and  discriminatory  resolution  for  the  detection  of  rare  events16,17.  The  supervised  LDA  then  classifies  the 
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2.  ART2  UNSUPERVISED  NEURAL  NETWORK 
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Figure  1.  Structure  of  the  ART2  network. 

stage  sSnn^rhTf^  'S  $h°  Wn  in  Figure  1 '  11  consists  of  two  Pans:  the  ART2  network  and  the  learning 

stage.  Suppose  that  there  are  »  input  features  *  («=l,  ...  n)  and  k  classes  in  the  ART2  network  When  a  new  vecto  is 

presented  to  the  input  of  the  ART2  network,  an  activation  value  p,  for  classy  is  calculated  as- 


Pi='ZxiwH’  j  =  l...,k. 


(1) 


where  is  the  connection  weight  between  input  /  and  class  y.  The  activation  value  is  a  measure  of  the  membership  of  the 

vaTuen^TfTr  VCnt0r  t0  daSSy-  ThC  hi§her  the  ValuePy  is’ the  better  the  inPut  *■**  "latches  classy.  The  maximum 
value  pr  is  selected  from  all  pj  (y  =  1 . it)  to  find  the  best  class  match.  • 
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Furthermore,  in  order  to  balance  the  contribution  to  the  activation  value  from  all  feature  components,  the  input 
feature  values  applied  to  the  AJR.T2  system  are  scaled  between  zero  and  one17.  This  normalization  will  allow  detection  of 
similar  feature  patterns  even  when  the  magnitudes  of  the  input  feature  components  are  very  different. 

The  learning  stage  of  the  ART2  system  can  influence  the  weights  of  the  selected  class  or  the  complete  ART2 
network  structure  by  adding  a  new  class.  An  additional  parameter,  the  vigilance,  is  used  to  determine  the  type  of  learning14. 
The  vigilance  parameter  pvig  is  a  threshold  value  that  is  compared  to  the  maximum  activation  value  pr.  If  pr  is  larger  than  pvig 
then  the  input  vector  is  considered  to  belong  to  class  r.  The  adaptation  of  the  weights  connected  with  class  r  is  performed  as 
follows: 


=  Ktd  +  n  U,  -  )  for  i  =  1, ... ,  n ,  (2) 

where  77  is  a  learning  rate.  The  adaptation  of  the  class  r  weights  (Eq.  2),  aims  at  maximization  of  the  pr  value  for  the 
particular  input  vector.  In  an  iterative  manner  the  weights  are  adjusted  so  that  the  produced  activation  values  for  similar  input 
vectors  will  be  maximum  only  for  the  class  to  which  they  belong  and  these  maximum  activation  values  will  be  higher  than 

Pvig- 

If  the  maximum  activation  value  /?ris  smaller  than  pvigt  it  is  an  indication  that  a  novelty  has  appeared  and  a  new  class 
will  be  added  to  the  ART2  structure.  The  new  weights  connecting  the  input  with  the  new  class  (k+J)  are  initialized  with  the 
scaled  input  feature  values  of  this  novelty.  In  this  way  the  activation  value  pk+l  will  be  maximum  (pr  =  pk+1)  and  will  be 
higher  than  pvig<  when  it  is  computed  for  this  novelty  in  further  training  iterations.  The  value  of  the  vigilance  parameter  pviR 
determines  the  resolution  of  ART2.  It  can  be  chosen  in  the  range  between  0  and  1.  If  pvtg  is  relatively  small,  only  very 
different  input  feature  vectors  will  be  distinguished  and  separated  in  different  classes.  If  pvig  is  relatively  large  the  input 
feature  vectors  that  are  more  similar  will  be  separated  into  different  classes.  The  choice  of  pvtK  is  depends  on  the  particular 
application. 


3.  ART2LDA  CLASSIFIER 

Despite  the  good  performance  of  ART2  for  efficient  clustering  and  detection  of  novelties,  the  fast  learning  approach 
can  cause  problems  associated  with  the  generalization  capability  of  the  system  and  the  correct  classification  of  unknown 
cases.  Supervised  classifiers  such  as  linear  discriminants  or  backpropagation  neural  network  classifiers  can  have  better 
generalization  capability  than  ART2,  because  they  are  trained  by  averaging  over  similar  event  occurrences.  However,  these 
classifiers  do  not  have  the  ability  to  correctly  classify  rare  events. 

Input 


Malignant  Benign 


Figure  2.  Structure  of  the  ART2LDA  classifier. 
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n order  t0  imProve  the  accuracy  and  generalization  of  a  classifier,  we  propose  to  design  a  hybrid  classifier  that 
combines  the  unsupervised  ART2  network  and  a  supervised  LDA  classifier.  This  hybrid  classifier  (AHT2LDA)  utilizes  the 
good  resolution  capability  of  ART2  and  the  good  generalization  capability  of  LDA.  The  ART2  network  first  analyzes  the 
similarity  of  the  sample  population  and  identifies  a  subpopulation  that  may  be  separated  from  the  main  population.  This  will 
improve  the  performance  of  the  second-stage  LDA  if  the  subpopulation  causes  the  sample  population  to  deviate  from  a 
multivariate  normal,  distribution  for  which  LDA  is  an  optimal  classifier.  Therefore,  the  ART2  serves  as  a  screening  tool  to 
improve  the  normality  of  the  sample  distribution  by  classifying  outlying  samples  into  separate  classes. 

The  structure  of  the  hybrid  ART2LDA  classifier  is  shown  in  Fig.  2.  The  classes  identified  by  ART2  are  labeled  to 
be  one  of  the  two  types:  malignant  class  or  mixed  class.  A  particular  class  is  defined  as  malignant  if  it  contains  only 
malignant  members.  It  is  defined  as  mixed  if  it  contains  both  malignant  and  benign  members.  The  type  of  a  given  class  is 
determined  based  on  ART2  classification  of  the  training  data  set.  The  ART2  classifies  an  input  sample  “into  either  a 
malignant  or  a  mixed  class.  Depending  on  the  class  type  it  is  determined  whether  the  LDA  classifier  will  be  used.  If  an  input 
sample  is  classified  into  a  mixed  class,  the  final  classification  will  be  obtained  based  on  the  LDA  classifier,  which  has  been 
trained  by  the  mixed  classes  in  the  training  set.  However,  if  an  input  sample  is  classified  by  ART2  into  a  malignant  class 
then  the  mass  will  be  considered  malignant,  without  using  the  LDA  classifier.  Therefore,  in  the  ART2LDA  structure,  the 
ART2  is  used  both  as  a  classifier  and  a  supervisor. 


4.  MATERIALS  AND  METHODS 


4.1.  Data  set 

The  mammograms  used  in  this  study  were  randomly  selected  from  the  files  of  patients  who  had  undergone  biopsy  at 
the  University  of  Michigan.  The  criterion  for  inclusion  of  a  mammogram  in  the  data  set  was  that  the  mammogram  contained 
a  biopsy-proven  mass.  Approximately  equal  number  of  malignant  and  benign  masses  were  included.  The  data  set  contained 
348  mammograms  with  a  mixture  of  benign  (n=169)  and  malignant  (n=  179)  masses.  The  visibility  of  the  masses  was  rated 
by  a  radiologist  experienced  in  breast  imaging  on  a  scale  of  1  to  10,  where  the  rating  of  1  corresponds  to  the  most  visible 


Visibility 

Figure  3.  The  distribution  of  the  visibility  ranking  of  the 
masses  in  the  dataset.  The  ranking  was  performed  by  an 
experienced  radiologist.  (1:  very  obvious,  10:  very  subtle). 
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Malignancy  Rating 

Figure  4.  The  distribution  of  the  malignancy  ranking  of  the 
masses  in  the  dataset.  The  ranking  was  performed  by  an 
experienced  radiologist.  (1:  very  likely  benign,  10:  very 
likely  malignant). 
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category.  The  distributions  of  the  visibility  rating  for  both  the  malignant  and  benign  masses  are  shown  in  Fig.  3.  The 
visibility  ranged  from  subtle  to  obvious  for  both  types  of  masses.  It  can  be  observed  that  the  benign  masses  tend  to  be  more 
obvious  than  the  malignant  ones.  Additionally  the  likelihood  of  malignancy  for  each  mass  was  estimated  based  on  its 
mammographic  appearance.  The  radiologist  rated  the  likelihood  of  malignancy  on  a  scale  of  1  to  10,  where  1  indicated  a 
mass  with  the  most  benign  appearance.  The  distribution  of  the  malignancy  rating  of  the  masses  is  shown  in  Fig.  4. 

Three  hundred  and  five  of  the  mammograms  were  digitized  with  a  LUMISYS  DIS-1000  laser  scanner  at  a  pixel 
resolution  of  100  [lm  X  100/^m  and  4096  gray  levels.  The  digitizer  was  calibrated  so  that  gray  level  values  were  linearly 
and  inversely  proportional  to  the  optical  density  (OD)  within  the  range  of  0.1  to  2.8  OD  units,  with  a  slope  of  -0.001 
OD/pixel  value.  Outside  this  range,  the  slope  of  the  calibration  curve  decreased  gradually.  The  OD  range  of  the  digitizer  was 
0  to  3.5.  The  remaining  43  mammograms  were  digitized  with  a  LUMISCAN  85  laser  scanner  at  a  pixel  resolution  of  50  jum 
X  50 pm  and  4096  gray  levels.  The  digitizer  was  calibrated  so  that  gray  level  values  were  linearly  and  inversely  proportional 

to  the  OD  within  the  range  of  0  to  4  OD  units,  with  a  slope  of  -0.001  OD/pixel  value.  In  order  to  process  the  mammograms 
digitized  with  these  two  different  digitizers,  the  images  digitized  with  LUMISCAN  85  digitizer  were  convolved  with  a  2X2 
box  filter  and  subsampled  by  a  factor  of  two,  resulting  in  100  pm  images. 

In  order  to  validate  the  prediction  abilities  of  the  classifier,  the  data  set  was  partitioned  randomly  into  training  and 
test  subsets.  Approximately  73%  of  the  samples  have  been  used  for  training  and  27%  for  testing.  The  data  set  was 
repartitioned  randomly  ten  times  and  the  training  and  test  results  were  averaged  to  reduce  their  variability. 


4.2.  Feature  extraction 

The  texture  features  used  in  this  study  were  calculated  from  spatial  grey-level  dependence  (SGLD)  matrices6''’18  and 
run-length  statistics  (RLS)  matrices19.  The  SGLD  and  RLS  matrices  were  computed  from  the  images  obtained  by  the  rubber 
band  straightening  transform  (RBST)8.  The  RBST  maps  a  band  of  pixels  surrounding  the  mass  onto  the  Cartesian  plane  (a 
rectangular  region).  In  the  transformed  image,  the  mass  border  appears  approximately  as  a  horizontal  edge,  and  spiculations 
appear  approximately  as  vertical  lines.  A  complete  description  of  the  RBST  can  be  found  in  the  literature8. 

The  (i,j)th  element  of  the  SGLD  matrix  is  the  joint  probability  that  gray  levels  i  and  j  occur  in  a  direction  6  at  a 
distance  of  d  pixels  apart  in  an  image.  Based  on  our  previous  studies6,  a  bit  depth  of  eight  was  used  in  the  SGLD  matrix 
construction,  i.e.,  the  four  least  significant  bits  of  the  12  bit  pixel  values  were  discarded.  Thirteen  texture  measures  including 
correlation,  energy,  difference  entropy,  inverse  difference  moment,  entropy,  sum  average,  sum  entropy,  inertia,  sum  variance, 
difference  average,  difference  variance  and  two  types  of  information  measure  of  correlation  were  used.  These  measures  were 
extracted  from  each  SGLD  matrix  at  ten  different  pixel  pair  distances  (d=l,  2,  3,  4,  6,  8,  10,  12,  16  and  20)  and  in  four 
directions  (0  ,45  ,90  ,  and  135  ).  Therefore,  a  total  of  520  SGLD  features  were  calculated  for  each  image.  The 
definitions  of  the  texture  measures  are  given  in  the  literature6"8’18.  These  features  contain  information  about  image 
characteristics  such  as  homogeneity,  contrast,  and  the  complexity  of  the  image. 

RLS  texture  features  were  extracted  from  the  vertical  and  horizontal  gradient  magnitude  images,  which  were 
obtained  by  filtering  the  RBST  image  with  horizontally  or  vertically  oriented  Sobel  filters  and  computing  the  absolute 
gradient  value  of  the  filtered  image.  A  gray  level  run  is  a  set  of  consecutive,  collinear  pixels  in  a  given  direction  which  have 
the  same  gray  level  value.  The  run  length  is  the  number  of  pixels  in  a  run19.  The  RLS  matrix  describes  the  run  length 
statistics  for  each  gray  level  in  the  image.  The  (i,j)th  element  of  the  RLS  matrix  is  the  number  of  times  that  the  gray  level  i 
in  the, image  possesses  a  run  length  of  j  in  a  given  direction.  In  our  previous  study,  it  was  found  experimentally  that  a  bit 
depth  of  5  in  the  RLS  matrix  computation  could  provide  good  texture  characteristics8. 

Five  texture  measures,  namely,  short  run  emphasis,  long  run  emphasis,  gray  level  nonuniformity,  run  length 
nonuniformity,  and  run  percentage  were  extracted  from  the  vertical  and  horizontal  gradient  images  in  two  directions,  6  = 
0  °  ,  and  6  =90  °  .  Therefore,  a  total  of  20  RLS  features  were  calculated  for  each  ROI. 

A  total  of  540  features  (520  SGLD  and  20  RLS)  were  therefore  extracted  from  each  ROI. 
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4.3.  Feature  selection 


In  order  to  reduce  the  number  of  the  features  and  to  obtain  the  best  feature  set  to  design  a  good  classifier,  feature 
selection  with  stepwise  linear  discriminant  analysis20  was  applied.  At  each  step  of  the  stepwise  selection  procedure  one 
feature  is  entered  or  removed  from  the  feature  pool  by  analyzing  its  effect  on  the  selection  criterion.  In  this  study,  the  Wilks’ 
lambda  was  used  as  a  selection  criterion. 


4.4.  Performance  analysis 

To  evaluate  the  classifier  performance,  the  training  and  test  discriminant  scores  were  analyzed  using  receiver 
operating  characteristic  (ROC)  methodology.  The  discriminant  scores  of  the  malignant  and  benign  masses  were  used  as 
decision  variables  in  the  LABROC1  program21,  which  fit  a  binormal  ROC  curve  based  on  maximum  likelihood  estimation. 
The  classification  accuracy  was  evaluated  as  the  area  under  the  ROC  curve,  Az.  The  discriminant  scores  of  all  case  samples 
classified  in  the  two  stages  of  ART2LDA  are  combined.  All  masses  classified  into  the  malignant  group  by  the  ART2  stage 
were  assigned  a  constant  positive  discriminant  score  higher  than  or  equal  to  the  most  malignant  discriminant  score  obtained 
from  the  LDA  classifier. 

The  performance  of  ART2LDA  was  also  assessed  by  estimation  of  the  partial  area  under  the  ROC  curve  (Az(0,9))  at  a 
true  positive  fraction  (TPF)  higher  than  0.9.  The  partial  Az(09)  indicates  the  performance  of  the  classifier  in  the  high 
sensitivity  (low  false  negative)  region  which  is  most  important  for  cancer  detection  in  clinical  practice. 


5.  RESULTS 

In  this  study,  the  test  subset  was  kept  truly  independent  from  the  training  subset;  only  the  training  subset  was  used 
for  feature  selection  and  classifier  training,  and  only  the  test  subset  was  used  for  classifier  validation.  In  order  to  validate  the 
prediction  abilities  of  the  classifier,  ten  different  partitions  of  the  training  and  test  sets  were  used  and  the  average 
classification  results  were  estimated. 


Table  1.  Number  of  selected  features  for  the  10  data  groups. 


Data  Group  No. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Mean 

Number  of  selected 
features 

12 

15 

13 

18 

14 

14 

13 

18 

14 

14 

14 

For  a  given  partition  of  training  and  test  sets,  feature  selection  was  performed  based  on  the  training  set.  The  feature 
selection  results  for  the  ten  different  training  groups  are  shown  in  Table  1.  The  average  number  of  selected  features  was  14. 
The  selected  feature  sets  contained  an  average  of  two  RLS  features  and  twelve  SGLD  features.  A  different  ART2LDA 
classifier  was  trained  using  each  training  set  and  the  corresponding  set  of  selected  features. 


5.1.  ART2LDA  classification  results 

For  the  ART2LDA  classifier,  the  number  of  selected  features  determines  the  dimensionality  of  the  input  vector  of 
the  ART2  classifier  and  the  dimensionality  of  the  LDA  classifier.  By  using  different  values  for  the  vigilance  parameter, 
ART2  classifiers  with  different  number  of  classes  were  obtained.  In  this  study,  the  vigilance  parameter  pV(g  was  varied  from 
0.9  to  0.99,  resulting  in  a  range  of  10  to  240  classes.  The  overall  performance  of  the  ART2LDA  classifier  was  evaluated  for 
different  numbers  of  ART2  classes  because  different  subset  of  the  samples  were  separated  and  classified  by  ART2.  In  Fig.  5, 
the  classification  results  for  the  ART2LDA  are  compared  to  the  results  from  LDA  alone  for  the  training  and  test  set  partition 
no.  j.  The  classification  accuracy,  Az,  was  plotted  as  a  function  of  the  number  of  ART2  classes.  For  this  training  and  test  set 
partition,  when  the  number  of  classes  was  between  20  and  60,  the  ART2LDA  classifier  improved  the  classification  accuracy 
for  the  test  set  in  comparison  to  LDA.  As  the  number  of  classes  increased  to  greater  than  60,  the  Az  value  increased  for  the 
training  data  set,  but  decreased  for  the  test  data  set  and  was  lower  than  that  of  the  LDA  alone. 
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In  Table  2  the  Az  values  of  the  test  set  for  the  10  corresponding  partitions  are  shown.  The  average  test  Az  value  is 
0.81  for  the  ART2LDA  and  0.78  for  LDA  alone.  For  nine  of  the  ten  partitions,  the  Az  value  was  improved  by  the  hybrid 
classifier. 


Number  of  Classes 

Figure  5.  ART2LDA  and  LDA  classification  results  for 
training  and  test  sets  from  data  group  No.3  as  a  function  of 
the  number  of  classes  generated  by  ART2. 

The  performance  of  ART2LDA  was  also  assessed  by  estimation  of  the  partial  area  under  the  ROC  curve  Az,09)  at  a 
TPF  higher  than  0.9.  In  Table  3  the  Az(0'9>  values  of  the  test  set  for  the  10  partitions  of  training  and  test  sets  are  presented. 
The  average  test  Az(0'9)  value  is  0.34  for  the  ART2LDA  and  0.27  for  LDA.  For  nine  of  the  ten  partitions,  the  Az(0'9)  value  was 
improved  at  the  high  sensitivity  operating  region  (TPF>0.9)  of  the  ROC  curve. 


Table  2.  Classifiers  performance  for  the  10  test  sets.  The 
Az  values  represent  the  total  area  under  ROC  curve. 


Data  Group 
No. 

LDA 

ART2LDA 

1 

0.77 

0.83 

2 

0.78 

0.80 

3 

0.74 

0.78 

4 

0.77 

0.77 

5 

0.77 

0.78 

6 

0.80 

0.83 

7 

0.80  . 

0.81 

8 

0.77 

0.80 

9 

0.77 

0.80 

10 

0.86 

0.89 

Mean 

0.78 

0.81 

Table  3.  Classifiers  results  for  the  10  test  sets.  The  Az 
values  represent  the  partial  area  of  the  ROC  curve  above 
the  true  positive  fraction  of  0.9  (A/0'9)). 


Data  Group 

No. 

LDA 

ART2LDA 

1 

0.14 

0.23 

2 

0.17 

0.21 

3 

0.19 

0.32 

4 

0.19 

0.21 

5 

0.24 

0.26 

6 

0.27 

0.38 

7 

0.32 

0.31 

8 

0.32 

0.34 

9 

0.40 

0.49 

10 

0.44 

0.60 

Mean 

0.27 

0.34 
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6.  DISCUSSION 


In  this  paper  a  new  classifier  (AJRT2LDA)  is  designed  and  applied  to  the  classification  of  malignant  and  benign 
masses.  The  results  indicate  that  the  ART2LDA  classifier  has  better  generalizability  than  an  LDA  classifier  alone.  The 
ART2  classifier  groups  the  case  samples  that  are  different  from  the  main  population  into  separate  classes.  The  minimum 
number  of  classes  needed  to  start  the  clustering  of  outliers  into  separate  classes  depends  on  how  different  the  outliers  are 
from  the  rest  of  the  sample  population.  For  the  ten  different  partitions  of  the  training  and  test  sets  used  in  this  study,  the 
minimum  number  varied  between  13  and  15  classes.  When  the  number  of  ART2  classes  was  less  than  this  minimum  number 
of  classes,  the  ART2  classifier  generated  only  mixed  malignant-benign  classes  and  all  samples  were  transferred  to  the  LDA 
stage.  In  that  case,  the  ART2LDA  was  equivalent  to  the  LDA  classifier  alone.  When  a  higher  number  of  classes  was 
generated,  an  increased  number  of  cases  that  may  be  considered  outliers  of  the  general  data  population  was  removed 
(clustered  in  separate  classes).  For  the  ten  training  sets  used  in  this  study,  the  malignant  outliers  were  gradually  removed 
when  the  number  of  classes  increased.  The  training  accuracy  increased  when  the  number  of  classes  increased  and  A2  could 
reach  the  value  of  1.0.  However,  a  large  number  of  ART2  classes  led  to  overfitting  the  training  sample  set  and  poor 
generalization  in  the  test  set.  The  classification  accuracy  of  ART2  for  the  test  set  tended  to  decrease  when  the  number  of 
classes  was  greater  than  about  70.  The  large  number  of  classes  also  led  to  a  reduction  in  the  generalizability  of  the  second- 
stage  LDA;  the  training  of  LDA  with  a  small  number  of  samples  would  again  result  in  overfitting  the  training  set,  and  poor 
generalizability  in  the  test  set.  This  effect  was  observed  when  more  than  60  or  70  classes  were  generated  by  ART2  (see  Fig. 

The  classification  accuracy  of  ART2LDA  increased  initially  with  increased  number  of  classes  and  then  decreased 
after  reaching  a  maximum.  The  correct  classification  of  the  outliers  by  the  ART2  in  combination  with  an  improvement  in  the 
classification  by  the  LDA  resulted  in  the  increased  accuracy.  When  the  number  of  ART2  classes  was  further  increased,  the 
effects  of  overfitting  by  the  ART2  and  the  LDA  became  dominant  and  the  prediction  ability  of  the  ART2LDA  decreased.  In 
some  cases  the  second  stage  LDA  prediction  was  much  worse  than  the  ART2.  In  other  cases  the  ART2  could  not  generalize 
well.  The  generation  of  a  high  number  of  classes  is  therefore  impractical  and  unnecessary  both  from  computational  and 
methodological  point  of  view. 

When  the  partial  area  of  the  ROC  curve  above  the  true  positive  (TP)  fraction  of  0.9  (Az(0’9))  was  considered  as  a 
measure  of  classification  accuracy,  the  advantage  of  ART2LDA  over  LDA  alone  became  even  more  evident.  By  removing 
and  correctly  classifying  the  outliers  the  accuracy  of  the  classification  is  increased  at  the  high  sensitivity  end  of  the  curve. 

We  have  performed  statistical  tests  with  the  CLABROC  program  to  estimate  the  significance  in  the  differences 
between  the  Az  values  from  the  ART2LDA  and  the  LDA  alone,  as  well  as  in  the  differences  in  the  partial  A^0'9'  from  the  two 
classifiers.  The  statistical  tests  were  performed  for  each  individual  data  set  partition  because  the  correlation  among  the  data 
sets  from  the  different  partitions  precludes  the  use  of  Student’s  paired  t-test  with  the  ten  partitions.  We  found  that  the 
differences  in  both  cases  did  not  reach  statistical  significance  because  of  the  small  number  of  test  samples  and  thus  the  large 
standard  deviation  in  the  Az  values.  However,  the  consistent  improvements  in  Az  and  Az(0’9)  (9  out  of  10  data  set  partitions  in 
both  cases)  suggest  that  the  improvement  was  not  by  chance  alone,  and  that  the  accuracy  of  a  classification  task  could  be 
improved  by  the  use  of  an  ART2  network. 

An  important  difference  between  the  classifier  designed  in  this  study  and  many  others  in  the  CAD  field  is  the 
method  of  feature  selection.  In  several  previously  published  studies8’22'23  the  features  were  selected  from  the  entire  data  set 
firsthand  then  the  data  set  was  partitioned  into  training  and  test  sets.  This  meant  that  at  the  feature  selection  stage  of  the 
classifier  design,  the  entire  data  set  was  considered  to  be  a  training  set.  Depending  on  the  distribution  of  the  features  and  the 
total  number  of  samples  used,  the  test  results  in  these  studies  might  be  optimistically  biased24.  In  this  study,  initially  the 
entire  data  set  was  partitioned  into  training  and  test  sets  and  then  feature  selection  was  performed  only  on  the  training  set. 
This  method  results  in  a  pessimistic  estimate  of  the  classifier  performance24  when  the  training  set  is  small.  We  therefore 
expect  that  the  performance  will  be  improved  when  the  classifier  designed  in  this  study  is  trained  using  a  large  data  set. 
Since  our  main  purpose  in  this  study  was  to  compare  the  LDA  and  ART2LDA  classifiers,  we  did  not  attempt  to  quantify  how 
pessimistic  our  results  are  in  this  study. 


7.  CONCLUSION 

A  new  classifier  combining  an  unsupervised  ART2  and  a  supervised  LDA  has  been  designed  and  applied  to  the  •  • 
classification  of  malignant  and  benign  masses.  A  data  set  consisting  of  348  films  (179  malignant  and  169  benign)  was 
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randomly  partitioned  into  training  and  test  subsets.  Ten  different  random  partitions  were  generated.  For  each  training  set, 
texture  features  were  extracted  and  feature  selection  was  performed.  An  average  of  fourteen  features  were  selected  for  each 
group.  Ten  hybrid  ART2LDA  classifiers  and  ten  LDA  models  alone  were  trained  by  using  the  ten  training  sets.  The  average 
Az  value  under  the  ROC  curve  for  the  test  sets  was  better  for  ART2LDA  (Az=0.81)  compared  to  the  LDA  alone  (Az=0.78).  A 
greater  improvement  was  obtained  when  the  partial  ROC  area  above  a  true-positive  fraction  of  0.9  was  considered.  The 
average  partial  A2  for  ART2LDA  was  0.34  as  compared  to  0.27  for  LDA.  These  results  indicate  that  the  hybrid  classifier  is  a 
promising  approach  for  improving  the  accuracy  of  classifiers  for  CAD  applications. 
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Abstract 

We  have  investigated  the  use  of  an  active  contour  model  for  accurate  delineation  of  mass 
boundaries  on  mammograms.  The  model  used  smoothness  constraints  and  image  gradient 
information  in  order  to  refine  an  initial  boundary  provided  by  a  clustering  algorithm.  After 
segmentation  of  the  mass,  possible  spiculations  were  segmented  by  utilizing  gradient  direction 
statistics  in  a  region  surrounding  the  mass.  Spiculation  measures  and  morphological  features 
were  extracted  and  used  for  classifying  the  mass  as  malignant  or  benign.  The  classification 
accuracy  was  evaluated  using  the  area  Az  under  the  receiver  operating  characteristic  (ROC) 
curve.  A  data  set  containing  243  mammograms  from  101  patients  was  used  for  training  the 
classifier,  and  a  data  set  containing  95  mammograms  from  45  patients  were  used  for  testing  the 
classifier.  The  test  Az  for  the  task  of  classifying  a  mass  on  a  single  view  and  a  mass  on  all 
available  views  as  malignant  or  benign  was  0.81  and  0.87,  respectively.  Our  results  indicate  that 
the  spiculation  measures  and  the  morphological  features  extracted  from  automatically  segmented 
mass  boundaries  are  effective  in  characterizing  mammographic  masses  as  malignant  or  benign. 

1.  Introduction 

In  recent  years,  many  researchers  have  investigated  the  use  of  computer-extracted  image  features 
for  classification  of  breast  masses  as  malignant  or  benign  (Sahiner  et  al.  1998;  Huo  et  al.  1998; 
Leichter  et  al.  2000).  Many  features  used  in  computerized  breast  mass  characterization  require 
accurate  delineation  of  mass  boundaries  as  a  first  step.  Accurate  computerized  delineation  of 


mass  boundaries  is  often  difficult  because  of  the  presence  of  ill-defined  or  obscured  boundaries. 
The  human  visual  system  often  overcomes  this  problem  by  incorporating  a-priori  information, 
such  as  smoothness  of  mass  boundaries,  with  the  image  information.  In  order  to  make  use  of 
similar  information  for  computerized  mass  segmentation,  we  designed  an  active  contour  model 
based  on  the  image  characteristics  of  mammographic  masses.  The  new  model  was  used  to 
improve  the  boundaries  provided  by  a  clustering  algorithm  that  was  developed  in  our  earlier 
studies.  After  segmentation,  morphological  features  were  extracted  from  the  mass  shape,  and 
were  combined  with  spiculation  measures  for  the  characterization  of  breast  masses  as  malignant 
or  benign. 

2.  Mass  segmentation 

The  location  of  the  biopsied  mass  was  identified  by  an  MQSA-approved  radiologist.  A  region  of 
interest  (ROI)  containing  the  biopsied  mass  was  extracted  from  the  mammogram  for 
computerized  processing. 

2.1.  Initial  mass  segmentation 

The  mass  segmentation  method  employed  in  this  study  started  with  the  initial  detection  of  a  mass 
shape  within  an  ROI  using  a  K-means  clustering  algorithm.  This  technique  has  been  discussed 
in  detail  in  the  literature  (Sahiner  et  al.  1996).  Figures  l(a)-(d)  show  examples  of  a  spiculated 
and  a  nonspiculated  mass,  and  the  results  of  the  initial  segmentation. 

2.2.  Active  contour  segmentation 

Although  clustering-based  mass  segmentation  resulted  in  reasonable  mass  shapes  for  most  of  the 
masses,  the  segmentation  exhibited  inaccuracies  when  the  mass  was  not  very  conspicuous,  or 
when  some  parts  of  the  mass  were  obscured  by  overlapping  normal  breast  structures.  In 
addition,  further  refinement  was  necessary  before  detection  and  segmentation  of  spiculations. 


We  used  an  active  contour  model  for  the  first  stage  mass  shape  refinement,  and  spiculation 
detection  and  segmentation  for  the  final  shape  refinement. 

An  active  contour  is  a  deformable  continuous  curve,  whose  shape  is  controlled  by  internal  forces 
(the  model,  or  a-priori  knowledge  about  the  object  to  be  segmented)  and  external  forces  (the 
image).  The  internal  forces  impose  a  smoothness  constraint  on  the  contour,  and  the  external 
forces  push  the  contour  towards  salient  image  features,  such  as  edges.  To  solve  a  segmentation 
problem,  an  initial  boundary  is  iteratively  deformed  so  that  the  energy  due  to  internal  and 
external  forces  is  minimized  along  the  contour. 

The  internal  energy  components  in  our  active  contour  model  were  the  continuity  and  curvature  of 
the  contour,  as  well  as  the  homogeneity  of  the  segmented  object.  The  external  energy 
components  were  the  negative  of  the  smoothed  image  gradient  magnitude,  and  a  balloon  force 
that  exerted  pressure  at  a  normal  direction  to  the  contour.  The  contour  was  represented  by  the 
vertices  of  an  A-point  polygon  whose  vertices  were  v(i)=(x(i),y(i)),  i=l,...,N.  The  energy  to  be 
minimized  was  defined  as 

E  ~  [^curv^curv  (0  ^conl^cont  (*')  ^ grad  ^ grad  (0  ^bal^bal  0')]"^  ^hom^hom 

i=l 

where  each  energy  term  has  a  weight,  w. 

The  curvature  energy  term  is  represented  by  an  approximation  to  the  second  derivative  of  the 
contour,  Ecurv( i )  =  |v( i  - 1 )- 2\( i)+\(i  +  l ) | .  This  term  is  large  when  the  angle  at  vertex  i  is 

small.  By  discouraging  small  angles  at  vertices,  this  term  attempts  to  smooth  the  contour.  The 
continuity  term,  wcontEcoJi),  reflects  the  deviation  of  the  length  of  the  line  segment  under 


consideration  from  the  average  line  segment  length  d  .  This  term  favors  contours  with  regular 
spacing  between  the  vertices  over  those  with  irregular  spacing.  The  image  gradient  magnitude  is 
obtained  by  smoothing  the  image  with  a  low-pass  filter,  finding  the  partial  derivatives  in  the 
horizontal  and  vertical  directions,  and  then  computing  the  magnitude  of  the  partial  derivative 
vector.  Since  the  gradient  energy,  EgrJi),  is  defined  as  the  negative  of  the  gradient  magnitude, 
minimizing  this  term  attracts  the  contour  to  object  edges.  The  balloon  energy  encourages  the 
contour  to  expand  in  the  normal  direction,  which  is  required  to  prevent  the  contour  from 
collapsing  onto  itself  (Cohen  1991).  The  purpose  of  the  homogeneity  term,  whomEhom(i),  is  to 
make  the  object  and  the  background  regions  as  homogeneous  as  possible  within  each  region,  and 
to  maximize  the  difference  between  the  two  regions  (Poon  and  Braun  1997). 

To  minimize  the  contour  energy,  we  used  a  greedy  algorithm  that  was  first  proposed  by  Williams 
and  Shah  (Williams  and  Shah  1992).  In  this  algorithm,  the  contour  was  iteratively  optimized, 
starting  with  the  initial  contour  provided  by  clustering-based  segmentation.  At  each  iteration,  a 
neighborhood  of  each  vertex  was  examined,  and  the  vertex  was  moved  to  the  location  that 
minimized  the  contour  energy.  Figures  l(c)-(f)  show  the  initial  and  final  contours,  respectively, 
of  the  model  for  a  spiculated  and  a  nonspiculated  mass. 

2.3.  Segmentation  of  spiculations 

Spiculations  on  mammograms  appear  as  linear  structures  with  a  positive  image  contrast,  and  they 
usually  lie  in  a  radial  direction  to  the  mass.  As  a  result  of  their  linearity,  the  gradient  directions 
at  image  pixels  on  or  close  to  the  spiculation  are  more  or  less  in  the  same  orientation  relative  to 
that  of  the  spiculation.  In  order  to  investigate  whether  a  pixel  (ic,jc)  on  the  mass  contour  lies  on 
the  path  of  a  spiculation,  one  can  make  use  of  this  property  as  follows:  In  a  search  region  S  of 
the  image,  compute  the  statistics  of  the  angular  difference  6 between  the  image  gradient  direction 


at  image  pixel  (ij),  and  the  direction  of  the  vector  joining  pixels  (ic,jc),  and  (i,j)  (figure  2).  If  the 
pixel  ( ic,jc )  lies  on  the  path  of  a  spiculation,  then  6  will  be  close  to  nil  whenever  the  image  pixel 
(i,j)  is  on  the  spiculation.  Therefore,  the  distribution  of  6,  obtained  from  all  image  pixels  (ij) 
within  the  search  region  S  will  have  a  peak  around  n/2.  If  there  is  no  spiculation,  and  if  the  gray 
levels  in  S  are  randomly  distributed,  then  this  distribution  will  be  uniform.  Karssemeijer  et  al. 
have  made  use  of  a  similar  idea  for  detecting  spiculated  lesions  on  mammograms  (Karssemeijer 
and  te  Brake  1996),  but  not  for  the  detection  of  the  actual  spiculations.  In  our  method,  we 
combined  this  idea  with  the  fact  that  spiculations  generally  lie  in  a  radial  direction  to  the  mass. 
Therefore,  the  region  S  could  be  limited  so  that  other  gradients,  such  as  those  resulting  from  the 
mass  contour  itself,  can  be  excluded  from  the  distribution  of  gradients  in  S.  The  details  of  our 
spiculation  detection  method  are  described  in  the  literature  (Sahiner  et  al.  2000;  Chan  et  al. 
2000).  The  contours  of  a  spiculated  and  a  nonspiculated  mass  after  spiculation  detection  are 
shown  figures  1(g)  and  1(h),  respectively. 

3.  Feature  Extraction  and  Classification 

In  the  spiculation  segmentation  stage,  three  spiculation  measures  were  extracted  from  each  ROI. 
These  were  the  number  of  possible  spiculations  (NPS),  the  percentage  area  of  spiculations 
(PAS),  and  the  product  of  these  two  measures  (PR).  These  spiculation  measures  were  used  in 
addition  to  eleven  morphological  features  extracted  from  the  final  mass  outline  for  mass 
characterization.  The  first  five  morphological  features  were  based  on  the  normalized  radial 
length  (NRL),  defined  as  the  Euclidean  distance  from  the  object’s  centroid  to  each  of  its  edge 
pixels  and  normalized  relative  to  the  maximum  radial  length  for  the  object.  These  features 
included  NRL  mean,  standard  deviation,  entropy,  area  ratio,  and  zero  crossing  count  (Petrick  et 
al.  1999).  The  remaining  six  morphological  features  included  the  perimeter,  area,  perimeter-to- 


area  ratio,  circularity,  rectangularity,  and  contrast  of  the  object.  The  definition  of  these  features 
can  be  found  in  the  literature  (Petrick  et  al.  1999). 

Stepwise  feature  selection  was  used  to  select  effective  features  for  classification  from  the  feature 
space  of  fourteen  features.  Four  features,  namely,  NPS,  PR,  contrast,  and  circularity  were 
selected  using  the  set  of  training  ROIs.  A  backpropagation  neural  network  (BPN)  with  four  input 
nodes,  two  hidden-layer  nodes,  and  a  single  output  node  was  trained  using  the  training  set.  The 
accuracy  of  the  designed  classifier  was  evaluated  by  applying  the  classifier  to  test  cases  that  had 
not  been  used  for  training.  The  test  scores  were  analyzed  using  receiver  operating  characteristic 
(ROC)  methodology.  The  classification  accuracy  was  evaluated  as  the  area  Az  under  the  ROC 
curve. 

4.  Data  Set 

The  mammograms  used  in  this  study  were  randomly  selected  from  the  files  of  patients  in  the 
Radiology  Department  at  the  University  of  Michigan  who  had  undergone  biopsy.  The  criterion 
for  inclusion  of  a  mammogram  in  the  data  set  was  that  the  mammogram  contained  a  biopsy- 
proven  mass,  and  that  approximately  equal  numbers  of  malignant  and  benign  masses  were 
present  in  the  data  set.  Our  training  data  set  consisted  of  243  mammograms  (116  benign  and  127 
malignant)  from  101  patients.  Our  test  data  set  consisted  of  95  mammograms  (42  benign  and  53 
malignant)  from  45  patients.  A  single  view  was  available  for  nine  of  these  45  patients.  For  the 
remaining  36  test  patients,  two  or  more  views  were  available.  The  true  pathology  of  all  the 
masses  was  determined  by  biopsy  and  histologic  analysis. 

5.  Results 

We  investigated  film-based  classification  of  the  masses  on  each  mammogram,  as  well  as  case- 
based  classification  by  combining  possible  multiple  views  of  the  same  mass.  For  case-based 


classification,  the  BPN  scores  from  different  views  were  averaged.  The  training  Az  values  for 
film-based  and  case-based  classification  were  0.91  and  0.95  respectively.  The  test  Az  values  for 
film-based  and  case-based  classification  were  0.81  and  0.87.  The  training  and  test  ROC  curves 
are  shown  in  figures  3(a)  and  3(b),  respectively. 

6.  Discussion  and  Conclusion 

In  our  previous  work,  the  clustering  method  was  successful  in  segmenting  the  main  portion  of 
the  mass  from  the  background.  However,  a  major  limitation  of  clustering-based  segmentation  is 
that,  even  for  well-circumscribed  masses,  the  segmented  shape  contains  many  irregularities  due 
to  structured  or  random  noises  (see  figure  1(d)).  Another  limitation  is  that,  when  parts  of  the 
mass  are  obscured  by  overlapping  normal  breast  structure,  clustering  method  yields  inaccurate 
results.  In  this  study,  we  used  an  active  contour  model  for  refining  the  clustering-based 
segmentation  results.  By  choosing  a  balance  between  the  active  contour  weights  based  on  the 
training  set,  we  were  able  to  obtain  object  shapes  that  were  mostly  smooth,  but  contours  with 
sharp  turns  were  also  possible  if  the  object  boundary  contained  large  gradients.  Compared  to 
clustering,  the  resulting  boundaries  were  subjectively  judged  to  be  closer  to  actual  mass 
boundaries.  However,  the  active  contour  model  was  not  suitable  for  the  segmentation  of 
spiculations.  Since  the  spiculations  do  not  have  a  large  gradient  magnitude,  the  contour  cannot 
have  sharp  turns  at  spiculation  locations  unless  wcurv  is  very  small.  However,  a  small  value  for 
wcurv  is  not  practical,  because  it  results  in  mass  shapes  that  are  too  irregular  all  around  the  contour. 
For  this  reason,  we  designed  an  additional  stage  for  detection  and  segmentation  of  spiculations. 

Our  results  indicate  that  accurate  segmentation  of  mammographic  masses,  detection  of 
spiculations,  and  the  use  of  morphological  and  spiculation  features  can  be  effective  in  classifying 
breast  masses  as  malignant  or  benign. 
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Figure  1.  (a),  (b)  The  mass  ROI,  (c),  (d)  clustering-based  segmentation,  (e),  (f)  active-contour 
based  segmentation,  and  (g),  (h)  the  result  of  spiculation  detection  and  segmentation  for  a 
spiculated  mass  (a,  c,  e,  and  g)  and  a  nonspiculated  mass  (b,  d,  f,  and  h). 
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Figure  2.  The  definition  of  the  angular  difference  6. 
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Figure  3.  ROC  curves  for  film-based  and  case-based  classification,  (a)  Training  (b)  Test. 
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Abstract — A  new  type  of  classifier  combining  an  unsupervised 
and  a  supervised  model  was  designed  and  applied  to  classifi¬ 
cation  of  malignant  and  benign  masses  on  mammograms.  The 
unsupervised  model  was  based  on  an  adaptive  resonance  theory 
(ART2)  network  which  clustered  the  masses  into  a  number  of 
separate  classes.  The  classes  were  divided  into  two  types:  one 
containing  only  malignant  masses  and  the  other  containing  a  mix 
of  malignant  and  benign  masses.  The  masses  from  the  malignant 
classes  were  classified  by  ART2.  The  masses  from  the  mixed 
classes  were  input  to  a  supervised  linear  discriminant  classifier 
(LDA).  In  this  way,  some  malignant  masses  were  separated 
and  classified  by  ART2  and  the  less  distinguishable  benign  and 
malignant  masses  were  classified  by  LDA.  For  the  evaluation  of 
classifier  performance,  348  regions  of  interest  (ROI’s)  containing 
biopsy  proven  masses  (169  benign  and  179  malignant)  were  used. 
Ten  different  partitions  of  training  and  test  groups  were  randomly 
generated  using  an  average  of  73%  of  ROPs  for  training  and 
27%  for  testing.  Classifier  design,  including  feature  selection  and 
weight  optimization,  was  performed  with  the  training  group. 
The  test  group  was  kept  independent  of  the  training  group.  The 
performance  of  the  hybrid  classifier  was  compared  to  that  of 
an  LDA  classifier  alone  and  a  backpropagation  neural  network 
(BPN).  Receiver  operating  characteristics  (ROC)  analysis  was 
used  to  evaluate  the  accuracy  of  the  classifiers.  The  average  area 
under  the  ROC  curve  (Az)  for  the  hybrid  classifier  was  0.81  as 
compared  to  0.78  for  the  LDA  and  0.80  for  the  BPN.  The  partial 
areas  above  a  true  positive  fraction  of  0.9  were  0.34,  0.27  and 
0.31  for  the  hybrid,  the  LDA  and  the  BPN  classifier,  respectively. 
These  results  indicate  that  the  hybrid  classifier  is  a  promising 
approach  for  improving  the  accuracy  of  classification  in  CAD 
applications. 

Index  Terms —  Computer-aided  diagnosis,  hybrid  classifier, 
mammography,  neural  networks. 

I.  Introduction 

MAMMOGRAPHY  is  the  most  effective  method  for 
detection  of  early  breast  cancer  [1].  However,  the 
specificity  for  classification  of  malignant  and  benign  lesions 
from  mammographic  images  is  relatively  low.  Clinical  studies 
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have  shown  that  the  positive  predictive  value  (i.e.,  ratio  of  the 
number  of  breast  cancers  found  to  the  total  number  of  biopsies) 
is  only  15%  to  30%  [2]-[4].  It  is  important  to  increase  the 
positive  predictive  value  without  reducing  the  sensitivity  of 
breast  cancer  detection.  Computer-aided  diagnosis  (CAD)  has 
the  potential  to  increase  the  diagnostic  accuracy  by  reducing 
the  false-negative  rate  while  increasing  the  positive  predictive 
values  of  mammographic  abnormalities. 

Classifier  design  is  an  important  step  in  the  development 
of  a  CAD  system.  A  classifier  has  to  be  able  to  merge 
the  available  input  feature  information  and  make  a  correct 
evaluation.  Commonly  used  classifiers  for  CAD  include  linear 
discriminants  (LDA)  [5],  [6]  and  backpropagation  neural  net¬ 
works  (BPN)  [7]-[9]  which  have  been  shown  to  perform  well 
in  lesion  classification  problems  [10]— [22].  These  classifiers 
are  generally  designed  by  supervised  training.  However,  these 
types  of  classifiers  have  limitations  dealing  with  the  nonlin¬ 
earities  in  the  data  (in  case  of  LDA)  and  in  generalizability 
when  a  limited  number  of  training  samples  are  available 
(especially  BPN).  Another  classification  approach  is  based  on 
unsupervised  classifiers,  which  cluster  the  data  into  different 
classes  based  on  the  similarities  in  the  properties  of  the  input 
feature  vectors.  Therefore,  unsupervised  classifiers  can  be  used 
to  analyze  the  similarities  within  the  data.  However,  it  is 
difficult  to  use  them  as  a  discriminatory  classifier  [29],  [30], 
They  also  have  limited  generalizability  when  the  training 
sample  set  is  small. 

We  propose  here  a  hybrid  unsupervised/supervised  struc¬ 
ture  to  improve  classification  performance.  The  design  of 
this  structure  was  inspired  by  neural  information  processing 
principles  such  as  self  organization,  decentralization  and  gen¬ 
eralization.  It  combines  the  adaptive  resonance  theory  network 
(ART2)  [26],  [27]  and  the  LDA  classifier  as  a  cascade  system 
(ART2LDA).  The  self-organizing  unsupervised  ART2  network 
automatically  decomposes  the  input  samples  into  classes  with 
different  properties.  The  ART2  network  has  been  found  to 
perform  better  compared  to  conventional  clustering  techniques 
in  terms  of  learning  speed  and  discriminatory  resolution  for  the 
detection  of  rare  events  in  many  classification  tasks  [28]— [30]. 
The  supervised  LDA  then  classifies  the  samples  belonging  to 
a  subset  of  classes  that  have  greater  similarities.  By  improving 
the  homogeneity  of  the  samples,  the  classifier  designed  for  the 
subset  of  classes  may  be  more  robust. 

The  ART2LDA  design  implements  both  structural  and  data 
decomposition.  Decomposition  is  a  powerful  approach  that  can 
reduce  the  complexity  of  a  problem.  Both  structural  decom- 
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position  and  data  decomposition  can  improve  classification 
accuracy  [23]  as  well  as  model  accuracy  [24].  However, 
decomposition  can  also  reduce  the  prediction  accuracy  due  to 
overfitting  the  training  data.  We  will  demonstrate  in  this  paper 
that  the  proposed  hybrid  structure  can  reduce  the  overfitting 
problem  and  improve  the  prediction  capabilities  of  the  system. 
The  performance  of  the  hybrid  ART2LDA  classifier  will  be 
compared  with  those  of  an  LDA  alone  or  a  BPN  classifier. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  II 
the  ART2  unsupervised  network  is  described.  A  hybrid 
ART2LDA  classifier  is  introduced  in  Section  ID.  Section  IV 
describes  the  data  set  used  in  this  study.  The  results  are 
presented  in  Section  V.  Section  VI  contains  discussion  of 
these  results.  Finally,  Section  VII  concludes  this  investigation. 

II.  ART2  Unsupervised  Neural  Network 

The  ART2  is  a  self-organizing  system  that  can  simulate 
human  pattern  recognition.  ART2  was  first  described  by  Gross- 
berg  [25]  and  a  series  of  further  improvements  were  carried 
out  by  Carpenter,  Grossberg,  and  coworkers  [26]-[28].  The 
ART2  network  clusters  the  data  into  different  classes  based  on 
the  properties  of  the  input  feature  vectors.  The  members  within 
a  class  have  similar  properties.  The  process  of  ART2  network 
learning  is  a  balance  between  the  plasticity  and  stability 
dilemma.  Plasticity  is  the  ability  of  the  system  to  discover 
and  remember  important  new  feature  patterns.  Stability  is 
the  ability  of  the  system  to  remain  unchanged  when  already 
known  feature  patterns  with  noise  are  input  to  the  system.  The 
balance  between  plasticity  and  stability  for  the  ART2  training 
algorithm  allows  fast  learning  [28],  i.e.,  rare  events  can  be 
memorized  with  a  small  number  of  training  iterations  without 
forgetting  previous  events.  The  more  conventional  training 
algorithms,  such  as  back  propagation  [7]-[9],  perform  slow 
learning,  i.e.,  they  tend  to  average  over  occurrences  of  similar 
events  and  require  many  training  iterations. 

The  structure  of  the  ART2  system  is  shown  in  Fig.  1.  It 
consists  of  two  parts:  the  ART2  network  and  the  learning  stage. 
Suppose  that  there  are  n  input  features  Xi  {%  =  1,  •  •  • }  n)  and  k 
classes  in  the  ART2  network.  When  a  new  vector  is  presented 
to  the  input  of  the  ART2  network,  an  activation  value  pj  for 
class  j  is  calculated  as 

n 

Pj  —  J]  XiWij,  j  =  l,---,k  (1) 

where  Wij  is  the  connection  weight  between  input  i  and  class 
j.  The  activation  value  is  a  measure  of  the  membership  of  the 
particular  input  feature  vector  to  class  j.  The  higher  the  value 
Pj  is,  the  better  the  input  vector  matches  class  j .  The  maximum 
value  pr  is  selected  from  all  pj  (j  =  1,  •  •  • ,  k)  to  find  the  best 
class  match.  Furthermore,  in  order  to  balance  the  contribution 
to  the  activation  value  from  all  feature  components,  the  input 
feature  values  applied  to  the  ART2  system  are  scaled  between 
zero  and  one  [30].  This  normalization  will  allow  detection  of 
similar  feature  patterns  even  when  the  magnitudes  of  the  input 
feature  components  are  very  different. 

The  learning  stage  of  the  ART2  system  can  influence  the 
weights  of  the  selected  class  or  the  complete  ART2  network 


Xi  x2  XS  x4  xn  Features 


structure  by  adding  a  new  class.  An  additional  parameter,  the 
vigilance,  is  used  to  determine  the  type  of  learning  [26].  The 
vigilance  parameter  pv\g  is  a  threshold  value  that  is  compared 
to  the  maximum  activation  value  pr.  If  pr  is  larger  than  pv\g 
then  the  input  vector  is  considered  to  belong  to  class  r.  The 
adaptation  of  the  weights  connected  with  class  r  is  performed 
as  follows: 

w?rew  =  <!d  +  r,{Xi  -  w°!d),  for  i  =  1,  •  •  • ,  n  (2) 

where  77  is  a  learning  rate.  The  adaptation  of  the  class  r  weights 
(2),  aims  at  maximization  of  the  pr  value  for  the  particular 
input  vector.  In  an  iterative  manner  the  weights  are  adjusted 
so  that  the  activation  values  produced  for  similar  input  vectors 
will  be  maximum  only  for  the  class  to  which  they  belong  and 
these  maximum  activation  values  will  be  higher  than  pv\g. 

If  the  maximum  activation  value  pr  is  smaller  than  pvig,  it  is 
an  indication  that  a  novelty  has  appeared  and  a  new  class  will 
be  added  to  the  ART2  structure.  The  new  weights  connecting 
the  input  with  the  new  class  (A:  -j-  1)  are  initialized  with  the 
scaled  input  feature  values  of  this  novelty.  In  such  a  way,  the 
activation  value  pk+i  will  be  maximum  ( pr  =  pk+ 1)  higher 
than  pvig  when  computed  for  this  novelty  in  further  training 
iterations.  The  value  of  the  vigilance  parameter  pv[g  determines 
the  resolution  of  ART2.  It  can  be  chosen  in  the  range  between 
zero  and  one.  In  the  case  that  pvig  is  relatively  small,  only 
very  different  input  feature  vectors  will  be  distinguished  and 
separated  in  different  classes.  If  pvig  is  relatively  large,  the 
input  feature  vectors  that  are  more  similar  will  be  separated 
into  different  classes.  The  value  of  pVig  is  selected  differently 
depending  on  the  particular  application. 

HI.  ART2LDA  CLASSIFIER 

Despite  the  good  performance  of  ART2  for  efficient  clus¬ 
tering  and  detection  of  novelties,  the  fast  learning  approach 
can  cause  problems  associated  with  the  generalization  capa¬ 
bility  of  the  system  and  the  correct  classification  of  unknown 
cases.  Supervised  classifiers  such  as  linear  discriminants  or 
backpropagation  neural  network  classifiers  can  have  better 
generalization  capability  than  ART2,  because  they  are  trained 
by  averaging  over  similar  event  occurrences.  However,  the 
learning  process  in  these  traditional  learning  algorithms  tends 
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to  erase  the  memory  of  previous  expert  knowledge  when  a  new 
type  of  expertise  is  being  learned.  Therefore,  these  classifiers 
do  not  have  as  good  an  ability  to  correctly  classify  rare  events 
as  ART2  [28],  [29]. 

In  order  to  improve  the  accuracy  and  generalization  of  a 
classifier,  we  propose  to  design  a  hybrid  classifier  that  com¬ 
bines  the  unsupervised  ART2  network  and  a  supervised  LDA 
classifier.  This  hybrid  classifier  (ART2LDA)  utilizes  the  good 
resolution  capability  of  ART2  and  the  good  generalization 
capability  of  LDA.  The  ART2  first  analyzes  the  similarity  of 
the  sample  population  and  identifies  a  subpopulation  that  may 
be  separated  from  the  main  population.  This  will  improve  the 
performance  of  the  second-stage  LDA  if  the  subpopulation 
causes  the  sample  population  to  deviate  from  multivariate 
normal  distributions  for  which  LDA  is  an  optimal  classifier. 
Therefore,  the  ART2  serves  as  a  screening  tool  to  improve 
the  homogeneity  of  the  sample  distributions  by  classifying 
outlying  samples  into  separate  classes. 

The  ART2LDA  hybrid  classifier  can  be  described  as 

Val  =  9{h{x))h{x)  4- 1  -  gfofa))  (3) 

where  x  is  the  input  vector,  /i(-)  is  the  LDA  classifier,  /2(-)  is 
the  ART2  classifier,  and  g(-)  is  a  binary  membership  function, 
which  labels  the  classes  identified  by  ART2  to  be  one  of  the 
two  types:  malignant  class  or  mixed  class.  A  particular  class 
is  defined  as  malignant  if  it  contains  only  malignant  members. 
It  is  defined  as  mixed  if  it  contains  both  malignant  and  benign 
members.  The  membership  function  is  defined  as  follows: 

0,  if  c  is  a  malignant  class  . 

1,  if  c  is  a  mixed  class. 

The  type  of  a  given  class  is  determined  based  on  ART2 
classification  of  the  training  data  set. 

The  structure  of  the  ART2LDA  classifier  is  shown  in  Fig.  2. 
The  ART2  classifies  the  input  sample  x  into  either  a  malignant 
or  a  mixed  class.  Depending  on  the  class  type  the  function 
g(-)  determines  whether  the  LDA  classifier  will  be  used. 
If  x  is  classified  into  a  mixed  class,  the  final  classification 
will  be  obtained  based  on  the  LDA  classifier.  However,  if 
x  is  classified  by  ART2  into  a  malignant  class,  then  the 
mass  will  be  considered  malignant,  without  using  the  LDA 
classifier.  Therefore,  in  the  ART2LDA  structure,  the  ART2 
is  used  both  as  a  classifier  and  a  supervisor.  This  can  be 
seen  in  (3).  The  first  term  in  (3),  is  the  LDA 

classifier  multiplied  by  the  ART2  control  part  g(f2(x)).  The 
second  term  in  (3),  (1  -  g(f2(x))),  gives  the  classification 
result  of  the  ART2  stage.  If  f2(x)  is  a  malignant  class,  then 
g(f2(x))  =  0,  the  LDA  stage  is  eliminated,  and  the  classifier 
output  yAL  is  equal  to  1.  On  the  other  hand,  if  f2(x)  is  a 
mixed  class,  then  g(f2{x))  =  1,  the  ART2  term  is  eliminated, 
and  the  final  classification  is  determined  by  the  LDA  classifier 
(Val  =  fi(x)). 

IV.  Methods 

A.  Data  Set 

The  mammograms  used  in  this  study  were  randomly  se¬ 
lected  from  the  files  of  patients  who  had  undergone  biopsies 


x 
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Fig.  2.  Structure  of  the  ART2LDA  classifier. 

at  the  University  of  Michigan.  The  criterion  for  inclusion 
of  a  mammogram  in  the  data  set  was  that  the  mammogram 
contained  a  biopsy-proven  mass.  The  data  set  contained  348 
mammograms  with  a  mixture  of  benign  (n  =  169)  and 
malignant  (n  =  179)  masses.  On  each  mammogram,  a  region 
of  interest  (ROI)  containing  the  mass  was  identified  by  a 
radiologist  experienced  in  breast  imaging.  The  visibility  of 
the  masses  was  rated  by  the  radiologist  on  a  scale  of  1  to  10, 
where  the  rating  of  1  corresponds  to  the  most  visible  category. 
The  distributions  of  the  visibility  rating  for  both  the  malignant 
and  benign  masses  are  shown  in  Fig.  3.  The  visibility  ranged 
from  subtle  to  obvious  for  both  types  of  masses.  It  can  be 
observed  that  the  benign  masses  tend  to  be  more  obvious  than 
the  malignant  ones.  Additionally  the  likelihood  of  malignancy 
for  each  mass  was  estimated  based  on  its  mammographic 
appearance.  The  radiologist  rated  the  likelihood  of  malignancy 
on  a  scale  of  1  to  10,  where  1  indicated  a  mass  with  the  most 
benign  appearance.  The  distribution  of  the  malignancy  rating 
of  the  masses  is  shown  in  Fig.  4. 

The  data  set  can  be  considered  as  representative  of  the 
patient  population  that  is  sent  for  biopsy  under  current  clinical 
criteria.  Some  characteristics  of  many  malignant  and  benign 
masses  can  be  visually  distinguished  by  radiologists.  However, 
there  is  also  a  nonnegligible  fraction  of  malignant  masses  that 
are  very  similar  to  benign  masses  (the  low  malignancy  rating 
region  in  Fig.  4).  The  estimated  likelihood  of  malignancy  of 
malignant  and  benign  masses  that  are  sent  for  biopsy  basically 
overlaps  over  the  entire  range.  This  is  consistent  with  the  fact 
that  in  order  not  to  miss  malignant  masses  radiologists  must 
recommend  biopsy  for  even  very  low  suspicion  lesions. 

Three  hundred  and  five  of  the  mammograms  were  digitized 
with  a  LUMISYS  DIS-1000  laser  scanner  at  a  pixel  resolution 
of  100  /xm  x  100  gm  and  4096  gray  levels.  The  digitizer 
was  calibrated  so  that  gray  level  values  were  linearly  and 
inversely  proportional  to  the  optical  density  (OD)  within  the 
range  of  0.1  to  2.8  OD  units,  with  a  slope  of  -0.001  OD/pixel 
value.  Outside  this  range,  the  slope  of  the  calibration  curve 
decreased  gradually.  The  OD  range  of  the  digitizer  was  0 
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Fig.  3.  The  distribution  of  the  visibility  ranking  of  the  masses  in  the  dataset. 
The  ranking  was  performed  by  an  experienced  breast  radiologist  (1:  very 
obvious,  10:  very  subtle). 


Malignancy  Ranking 


Fig.  4.  The  distribution  of  the  malignancy  ranking  of  the  masses  in  the 
dataset.  The  ranking  was  performed  by  an  experienced  breast  radiologist  (1: 
very  likely  benign,  10:  very  likely  malignant). 


to  3.5.  The  remaining  43  mammograms  were  digitized  with 
a  LUMISCAN  85  laser  scanner  at  a  pixel  resolution  of  50 
fim  x  50  fjm  and  4096  gray  levels.  The  digitizer  was 
calibrated  so  that  gray  level  values  were  linearly  and  inversely 
proportional  to  the  OD  within  the  range  of  0  to  4  OD  units, 
with  a  slope  of  -0.001  OD/pixel  value.  In  order  to  process  the 
mammograms  digitized  with  these  two  different  digitizers,  the 
images  digitized  with  LUMISCAN  85  digitizer  were  averaged 
with  a  2  x  2  box  filter  and  subsampled  by  a  factor  of  two, 
resulting  in  100  fim  images. 

In  order  to  validate  the  prediction  abilities  of  the  classifier, 
the  data  set  was  partitioned  randomly  into  training  and  test 
subsets  on  a  3:1  ratio,  under  the  constraints  that  both  the 
malignant  and  the  benign  samples  were  split  with  the  3:1  ratio 
and  that  the  images  from  the  same  patient  were  grouped  into 
the  same  (training  or  test)  subset.  These  constraints  caused 


the  subsets  to  deviate  from  an  exact  3:1  ratio.  The  data  set 
was  repartitioned  randomly  ten  times.  On  average,  73%  of  the 
samples  were  grouped  into  the  training  set  and  27%  into  the 
test  set.  The  training  and  test  results  from  the  ten  partitions 
were  averaged  to  reduce  their  variability. 


B.  Feature  Extraction 

A  rectangular  ROI  was  defined  to  include  the  radiologist- 
identified  mass  with  an  additional  surrounding  breast  tissue 
region  of  at  least  40  pixels  wide  from  any  point  of  the  mass 
border.  A  fully  automated  method  was  then  used  for  segmen¬ 
tation  of  the  mass  from  the  breast  tissue  background  within 
the  ROI.  The  rubber  band  straightening  transform  (RBST)  was 
previously  developed  [12]  to  map  a  band  of  pixels  surrounding 
the  mass  onto  the  Cartesian  plane  (a  rectangular  region).  In  the 
transformed  image,  the  border  of  mass  appears  approximately 
as  a  horizontal  edge  and  spiculations  appear  approximately 
as  vertical  lines.  The  transformation  of  the  radially  oriented 
textures  surrounding  the  mass  margin  to  a  more  uniform 
orientation  facilitates  the  extraction  of  texture  features. 

The  texture  features  used  in  this  study  were  calculated  from 
spatial  gray-level  dependence  (SGLD)  matrices  [10]— [12], 
[31],  and  run-length  statistics  (RLS)  matrices  [32]  computed 
from  the  RBST  images.  The  (z,j)th  element  of  the  SGLD 
matrix  is  the  joint  probability  that  gray  levels  i  and  j  occur  in 
a  direction  at  a  distance  of  9  pixels  apart  in  an  image.  Based 
on  our  previous  studies  [10],  a  bit  depth  of  eight  was  used  in 
the  SGLD  matrix  construction,  i.e.,  the  four  least  significant 
bits  of  the  12-bit  pixel  values  were  discarded.  Thirteen  texture 
measures,  including  correlation,  energy,  difference  entropy,  in¬ 
verse  difference  moment,  entropy,  sum  average,  sum  entropy, 
inertia,  sum  variance,  difference  average,  difference  variance, 
and  two  types  of  information  measure  of  correlation  were  used. 
These  measures  were  extracted  from  each  SGLD  matrix  at 
ten  different  pixel  pair  distances  ( d  =  1, 2, 3, 4, 6, 8, 10, 12, 16 
and  20)  and  in  four  directions  (0°,  45°,  90°,  and  135°). 
Therefore,  a  total  of  520  SGLD  features  were  calculated 
for  each  image.  The  definitions  of  the  texture  measures  are 
given  in  the  literature  [10]— [12],  [31].  These  features  contain 
information  about  image  characteristics  such  as  homogeneity, 
contrast,  and  the  complexity  of  the  image. 

RLS  texture  features  were  extracted  from  the  vertical  and 
horizontal  gradient  magnitude  images,  which  were  obtained 
by  filtering  the  RBST  image  with  horizontally  or  vertically 
oriented  Sobel  filters  and  computing  the  absolute  gradient 
value  of  the  filtered  image.  A  gray  level  run  is  a  set  of 
consecutive,  collinear  pixels  in  a  given  direction  which  have 
the  same  gray  level  value.  The  run  length  is  the  number  of 
pixels  in  a  run  [32],  The  RLS  matrix  describes  the  run  length 
statistics  for  each  gray  level  in  the  image.  The  (i.j) th  element 
of  the  RLS  matrix  is  the  number  of  times  that  the  gray  level  i 
in  the  image  possesses  a  run  length  of  j  in  a  given  direction. 
In  our  previous  study,  it  was  found  experimentally  that  a  bit 
depth  of  five  in  the  RLS  matrix  computation  could  provide 
good  texture  characteristics  [12]. 

Five  texture  measures,  namely,  short  run  emphasis,  long  run 
emphasis,  gray  level  nonuniformity,  run  length  nonuniformity. 
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and  run  percentage  were  extracted  from  the  vertical  and 
horizontal  gradient  images  in  two  directions,  9  =  0°  and  9  = 
90°.  Therefore,  a  total  of  20  RLS  features  were  calculated  for 
each  ROI.  The  formal  definition  of  the  RLS  feature  measures 
can  be  found  in  [32]. 

A  total  of  540  features  (520  SGLD  and  20  RLS)  were 
therefore  extracted  from  each  ROI. 

C.  Feature  Selection 

In  order  to  reduce  the  number  of  the  features  and  to  obtain 
the  best  feature  set  to  design  a  good  classifier,  feature  selection 
with  stepwise  linear  discriminant  analysis  [33]  was  applied. 
At  each  step  of  the  stepwise  selection  procedure  one  feature 
is  entered  or  removed  from  the  feature  pool  by  analyzing 
its  effect  on  the  selection  criterion.  In  this  study,  the  Wilks’ 
lambda  (the  ratio  of  within-group  sum  of  squares  to  the  total 
sum  of  squares  [34])  was  used  as  a  selection  criterion.  The 
optimization  procedure  used  a  threshold  Fin  for  feature  entry 
and  a  threshold  Fout  for  feature  removal.  On  a  feature  entry 
step,  the  features  not  yet  selected  are  entered  into  the  selected 
feature  pool  one  at  a  time,  the  significance  of  the  change  in  the 
Wilks’  lambda  caused  by  this  feature  is  estimated  based  on  F 
statistics.  The  feature  with  the  highest  significance  is  entered 
into  the  feature  pool  if  its  significance  is  higher  than  Fin.  On 
a  feature  removal  step,  the  features  which  have  already  been 
selected  are  analyzed  one  at  a  time  from  the  selected  feature 
pool  and  the  significance  of  the  change  in  the  Wilks’  lambda 
is  estimated.  The  feature  with  the  least  significance  is  removed 
from  the  selected  feature  pool  if  the  significance  is  less  than 
Fout.  Since  the  appropriate  values  of  Fin  and  Fout  are  not 
known  a  priori ,  we  examined  a  range  of  JFin  and  Fout  values 
and  chose  the  appropriate  thresholds  in  such  a  way  that  a 
minimum  number  of  features  were  selected  to  achieve  a  high 
accuracy  of  classification  by  LDA  for  the  training  sets.  More 
details  about  the  stepwise  linear  discriminant  analysis  and  its 
application  to  CAD  can  be  found  in  [10]— [12]. 

D.  Performance  Analysis 

To  evaluate  the  classifier  performance,  the  training  and 
test  discriminant  scores  were  analyzed  using  receiver  operat¬ 
ing  characteristic  (ROC)  methodology  [35].  The  discriminant 
scores  of  the  malignant  and  benign  masses  were  used  as 
decision  variables  in  the  LABROC1  program  [36],  which 
fit  a  binormal  ROC  curve  based  on  maximum  likelihood 
estimation.  The  classification  accuracy  was  evaluated  as  the 
area  under  the  ROC  curve,  Az .  For  the  ART2LDA  classifier, 
the  discriminant  scores  of  all  case  samples  classified  in  the  two 
stages  are  combined.  All  masses  classified  into  the  malignant 
group  by  the  ART2  stage  were  assigned  a  constant  positive 
discriminant  score  higher  than  or  equal  to  the  most  malignant 
discriminant  score  obtained  from  the  LDA  stage  . 

The  performance  of  ART2LDA  was  also  assessed  by  esti¬ 
mation  of  the  partial  area  index  (aI0,9^)  and  compared  with 
the  corresponding  performance  index  of  the  LDA  and  BPN 
classifiers.  The  partial  area  index  (Ai°'9^)  is  defined  as  the  area 
that  lies  under  the  ROC  curve  but  above  a  sensitivity  threshold 
of  0.9  (TPFq  =  0.9)  normalized  to  the  total  area  above  TPFq, 


TABLE  I 

Number  of  Selected  Features  for  the  Ten  Data  Groups 
with  the  Corresponding  F in  and  Fout  Parameters 


Data  Group 
No. 

Number  of 
selected 
features 

Fi„ 

Fout 

1 

12 

1.8 

1.6 

2 

15 

2.4 

2.2 

3 

13 

2.4 

2.2 

4 

18 

2.4 

2.2 

5 

14 

2.4 

2.2 

6 

14 

2.1 

1.8 

7 

13 

2.4 

2.2 

8 

18 

1.8 

1.6 

9 

14 

2.4 

2.2 

10 

14 

2.4 

2.2 

(1-TPFo).  The  partial  Ai0'9^1  indicates  the  performance  of  the 
classifier  in  the  high-sensitivity  (low  false  negative)  region 
which  is  most  important  for  clinical  cancer  detection  task.  In 
addition,  the  performance  of  the  LDA  stage  of  the  ART2LDA 
classifier  was  evaluated  by  the  estimation  of  the  area  under 
the  ROC  curve,  denoted  as  Az  (LDA),  for  the  case  samples 
passed  onto  the  LDA  classifier. 

V.  Results 

In  this  section  the  ART2LDA  classification  results  for 
malignant  and  benign  masses  will  be  presented  and  compared 
with  those  of  the  LDA  or  BPN  classifiers.  The  important 
point  in  this  study  is  the  fact  that  the  test  subset  is  truly 
independent  of  the  training  subset.  Only  the  training  subset 
is  used  for  feature  selection  and  classifier  training,  and  only 
the  test  subset  is  used  for  classifier  validation.  In'  order  to 
validate  the  prediction  abilities  of  the  classifier,  ten  different 
partitions  of  the  training  and  test  sets  were  used.  A  different 
ART2LDA  classifier  was  trained  using  each  training  set  and 
the  corresponding  set  of  selected  features.  The  classification 
result  was  estimated  as  the  average  performance  for  the  ten 
partitions. 

For  a  given  partition  of  training  and  test  sets,  feature 
selection  was  performed  based  on  the  training  set  alone.  The 
feature  selection  results  for  the  ten  different  training  groups  are 
shown  in  Table  I.  The  average  number  of  selected  features  was 
14.  An  average  of  two  RLS  features  and  twelve  SGLD  features 
were  selected  for  each  of  the  training  sets  which  represented 
10%  of  all  RLS  features  and  2.3%  of  all  SGLD  features, 
respectively.  Both  types  of  features  (RLS  and  SGLD)  are 
necessary  in  order  to  obtain  good  classification.  The  most  often 
selected  RLS  features  for  the  ten  training  sets  were:  horizontal 
short  run  emphasis  (four  times),  horizontal  long  run  emphasis 
(six  times),  vertical  run  length  nonuniformity  (three  times), 
horizontal  run  length  nonuniformity  (three  times).  The  most 
often  selected  SGLD  texture  measures  for  the  ten  training  sets 
were:  inverse  difference  moment  (eight  times),  information 
measure  of  correlations  one  and  two  (19  times),  difference 
average  (nine  times),  and  correlation  (ten  times).  For  a  given 
texture  measure,  features  at  different  angles  or  distances  may 
be  selected,  but  these  features  are  usually  highly  correlated  so 
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Fig.  5.  ART2LDA  and  LDA  classification  results  for  training  and  test  sets 
from  data  group  three  as  a  function  of  the  generated  number  of  classes. 
Additionally  the  results  for  the  LDA  stage  from  the  ART2LDA  classifier 
are  plotted. 

that  they  can  be  considered  to  be  similar  and  counted  together 
as  described  above. 

A.  ART2LDA  Classification  Results 

For  the  ART2LDA  classifier,  the  number  of  selected  features 
determines  the  dimensionality  of  the  input  vector  of  the  ART2 
classifier  and  the  dimensionality  of  the  LDA  classifier.  By 
applying  different  values  for  the  vigilance  parameter,  ART2 
classifiers  with  different  number  of  classes  were  obtained.  In 
this  study,  the  vigilance  parameter  pvig  was  varied  from  0.9 
to  0.99,  resulting  in  a  range  of  10  to  240  classes.  The  overall 
performance  of  the  ART2LDA  classifier  was  evaluated  for 
different  numbers  of  ART2  classes  because  different  subset 
of  the  samples  were  separated  and  classified  by  ART2  when 
Pvig  was  varied.  In  Fig.  5,  the  classification  results  for  the 
ART2LDA  are  compared  to  the  results  from  LDA  alone  for 
the  training  and  test  set  partition  three.  The  classification 
accuracy,  Az,  was  plotted  as  a  function  of  the  number  of 
ART2  classes.  For  this  training  and  test  set  partition,  when 
the  number  of  classes  was  between  20  and  60,  the  ART2LDA 
classifier  improved  the  classification  accuracy  for  the  test  set 
in  comparison  to  LDA.  As  the  number  of  classes  increased  to 
greater  than  60,  the  Az  value  increased  for  the  training  data 
set,  but  decreased  for  the  test  data  set  and  was  lower  than  that 
of  the  LDA  alone.  The  two  solid  lines  in  Fig.  5  show  the  Az 
values  for  the  LDA  stage  in  the  ART2LDA  classifier  for  both 
the  training  and  test  sets.  It  can  be  observed  that  the  test  Az 
for  the  LDA  stage  is  higher  than  the  Az  for  the  LDA  classifier 
alone,  but  not  as  high  as  Az  obtained  by  ART2LDA  when  the 
number  of  classes  is  small. 

In  Fig.  6  the  classification  results  of  LDA  and  ART2LDA 
for  the  partition  one  training  and  test  sets  are  shown.  In  this 


— ' ART2LDA  (tr) - LDA  stage  (tr) 

-a-  ART2LDA  (ts)  -  LDA  stage  (ts) 
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Fig.  6.  ART2LDA  and  LDA  classification  results  for  training  and  test  sets 
from  data  group  one  as  a  function  of  the  generated  number  of  classes. 
Additionally  the  results  for  the  LDA  stage  from  the  ART2LDA  classifier 
are  plotted. 

case  it  appeared  that  in  the  test  set  there  were  two  large 
malignant  outliers  which  degraded  the  LDA  performance. 
Only  15  classes  at  the  ART2  stage  in  the  ART2LDA  was 
enough  to  cluster  the  outliers  into  a  separate  malignant  class 
and  to  improve  the  performance  of  the  LDA  stage  and  the 
overall  result.  The  rest  of  the  outliers  required  more  ART2 
classes  before  they  were  clustered  into  separate  classes  and 
correctly  classified  as  malignant.  This  is  the  reason  for  the 
similar  behavior  of  the  classifiers  for  partitions  three  and  one 
in  the  range  of  40  to  70  classes  as  seen  in  Figs.  5  and  6. 
When  the  number  of  classes  was  less  than  70,  the  test  Az  for 
the  LDA  stage  (AZ(LDA))  was  higher  than  the  LDA  alone,  but 
not  as  high  as  the  Az  for  ART2LDA  with  less  than  30  classes 
(Fig.  6).  The  best  Az  values  for  the  test  data  sets  of  the  ten 
training  and  test  partitions  are  presented  in  Table  II  and  Fig.  7. 
The  ART2LDA  classifier  achieved  higher  Az  values  than  the 
LDA  alone  in  nine  of  the  ten  partitions.  The  average  Az  is 
0.81  for  ART2LDA  and  0.78  for  LDA  alone.  The  standard 
deviations  of  the  Az  values  for  the  ten  groups  range  from 
0.03  to  0.05  for  the  ART2LDA  classifier  and  from  0.04  to 
0.05  for  the  LDA  classifier: 

The  performance  of  ART2LDA  was  also  assessed  by  esti¬ 
mation  of  the  partial  area  under  the  ROC  curve  at  a 

TPF  higher  than  0.9.  The  results  are  presented  in  Table  III 
and  Fig.  7.  In  the  lower  part  of  Fig.  7,  the  values  of  the 
test  set  for  the  corresponding  ten  partitions  of  training  and  test 
sets  are  presented.  The  average  test  value  is  0.34  for  the 
ART2LDA  and  0.27  for  LDA.  For  nine  of  the  ten  partitions, 
the  A^'9^  value  was  improved  at  the  high-sensitivity  operating 
region  (TPF  >  0.9)  of  the  ROC  curve. 

The  classifier  performance  was  also  evaluated  when  the 
ART2LDA  classifiers  were  designed  using  a  fixed  number 


IEEE  TRANSACTIONS  ON  MEDICAL  IMAGING.  VOL.  18,  NO.  12,  DECEMBER  1999 


TABLE  II 

Classifiers  Performance  for  the  Ten  Test  Sets.  The  Az 
Values  Represent  the  Total  Area  Under  ROC  Curve 


Data  Group 
No. 

LDA 

ART2LDA 

BPN 

ART2LDA(1) 

1 

0.77 

0.83 

0.85 

0.80 

2 

0.78 

0.80 

0.82 

0.77 

3 

0.74 

0.78 

0.77 

0.78 

4 

0.77 

0.77 

0.75 

0.77 

5 

0.77 

0.78 

0.76 

0.77 

6 

0.80 

0.83 

0.S2 

0.81 

7 

0.80 

0.81 

0.82 

0.77 

8 

0.77 

0.80 

0.74 

0.75 

9 

0.77 

0.80 

0.81 

0.80 

10 

|  0.86 

0.89 

0.84 

0.89 

Mean 

0.78 

0.81 

0.80 

0.79 

— o—  LDA  (Az)  — 1 •—  LDA  (Az(a9)) 

—a—  ART2LDA  (Az)  ART2LDA  (Az(09)) 

— *—  ART2LDA(1)  (Az,09)) 


Data  Group  Number 


Fig.  7.  Average  Az  classification  results  for  the  10  test  sets.  The  top  graphs 
represent  the  ART2LDA  and  LDA  Az  values  for  the  total  area  under  the 
ROC  curve.  The  bottom  graphs  represent  the  ART2LDA,  ART2LDA(1)  and 
LDA  Az  values  for  the  partial  area  of  the  ROC  curve  above  the  true  positive 
fraction  of  0.9. 

TABLE  III 

Classifiers  Results  for  the  Ten  Test  Sets.  The  A- 
Values  Represent  the  Partial  Area  of  the  ROC  Curve 
Above  the  True  Positive  Fraction  of  0.9  (Ay '  ) 


Data  Group 
No. 

LDA 

ART2LDA 

BPN 

ART2LDA(1) 

1 

0.14 

0.23 

0.31 

0.26 

2 

0.17 

0.21 

0.28 

0.27 

3 

0.19 

0.32 

0.27 

0.32 

4 

0.19 

0.21 

0.19 

0.21 

5 

0.24 

0.26 

0.32 

0.24 

6 

0.27 

0.38 

0.27 

0.44 

7 

0.32 

0.31 

0.38 

0.30 

8 

0.32 

0.34 

0.25 

0.38 

9 

0.40 

0.49 

0.40 

0.49 

10 

0.44  1 

0.60 

0.38 

0.60 

Mean 

0.27 

0.34 

0.31 

0.35 

of  ART2  classes.  The  Az,  and  a!°'9)  results,  averaged  over 
the  ten  test  partitions,  are  presented  in  Table  IV.  The  average 
Az  with  the  ART2LDA  classifier,  compared  to  that  of  LDA 
alone,  was  again  improved  between  15  and  40  classes.  The 
maximum  average  Az  of  0.80  was  achieved  between  20  and 
40  classes.  The  average  Ai0'9^  results  are  improved  for  all 


TABLE  IV 

Average  Az  and  Average  Ay  9*  Classification  Results  for  the  Ten  Test 
Sets.  Classifiers  Were  Designed  Using  a  Fixed  Number  of  ART2  Classes 


LDA 

1  ART2LDA 

No.  of  classes 

15 

20 

30 

40 

50 

60 

Ai 

0.78 

0.80 

0.80 

0.80 

0.80 

0.78 

0.77 

aT5^ 

0.27  | 

0.30 

0.31 

0.33 

0.33 

0.31 

0.31 

ART2LDA  classifiers  presented  in  Table  IV.  The  maximum 
average  value  is  0.33  and  it  remains  constant  between 

30  and  40  classes. 

An  alternative  way  to  evaluate  the  performance  of  a  classi¬ 
fier  is  its  classification  accuracy  when  a  decision  threshold  for 
malignancy  is  selected  based  on  the  training  set.  For  instance, 
a  decision  threshold  may  be  selected  such  that  all  positive 
samples  from  the  training  set  are  classified  correctly  i.e.,  at  a 
sensitivity  of  100%.  The  ART2LDA  with  this  decision  thresh¬ 
old  is  referred  to  as  ART2LDA(1).  For  a  given  training  and 
test  partitioning,  ART2LDA  classifiers  with  different  number 
of  classes  in  the  ART2  stage  were  obtained  (Figs.  5  and  6).  For 
each  of  these  models  the  decision  threshold  for  a  sensitivity  of 
100%  was  selected  from  the  training  set  and  the  corresponding 
ART2LDA(1)  classifier  was  obtained.  Then  the  ART2LDA(1) 
classifier  (with  a  specific  number  of  classes  in  the  ART2  stage) 
that  correctly  classified  the  maximum  number  of  malignant 
masses  in  the  test  set  is  selected.  By  using  all  samples  of 
the  test  set,  the  Az  value  is  calculated  for  the  corresponding 
ART2LDA  model.  The  Az  values  for  the  ART2LDA(1)  classi¬ 
fiers  for  the  test  sets  of  the  ten  data  partitionings  are  shown  in 
Tables  II  and  III.  For  five  of  the  partitions  the  overall  Az  value 
for  ART2LDA(1)  is  higher  than  that  of  LDA  alone  (Table  II). 
The  average  Az  value  was  0.79.  The  partial  areas  above  the 
TP  fraction  of  0.9,  for  the  ten  test  data  sets  obtained 

by  the  ART2LDA(1)  classifier  are  also  shown  in  Fig.  7.  The 
ART2LDA(1)  achieved  the  highest  average  Ai0'9^  value  of 
0.35  compared  to  ART2LDA  and  LDA  (Table  III). 

B.  BPN  Classification  Results 

A  multilayer  perceptron  back-propagation  neural  network 
with  a  single  hidden  layer  and  a  single  output  node  was  used 
for  comparison  with  the  ART2LDA  classifier.  The  number 
of  selected  features  determined  the  number  of  input  nodes  to 
the  BPN.  The  same  ten  training/test  set  partitions  (as  in  the 
case  of  ART2LDA)  were  used  for  the  training  and  validation 
of  the  BPN  classifiers.  BPN’s  with  their  number  of  hidden 
nodes  ranging  from  two  to  ten  were  evaluated  to  obtain  the 
best  architecture.  Back-propagation  training  was  used.  Each 
of  the  BPN’s  was  trained  for  up  to  18000  training  epochs. 
At  every  1000  epochs  the  neural  network  weights  were  saved 
and  the  classification  result  for  the  corresponding  test  set  was 
evaluated.  This  design  procedure  was  repeated  for  each  of  the 
ten  training/test  groups.  For  each  group,  the  best  test  result 
among  all  the  BPN  architectures  (different  number  of  hidden 
nodes)  and  all  the  training  epochs  examined  was  selected. 
The  average  test  Az  over  the  ten  groups  for  the  BPN  was 
0.80,  compared  to  0.81  for  ART2LDA  (Table  II).  The  standard 
deviations  of  the  Az  values  for  the  ten  groups  range  from  0.04 
to  0.05  for  the  BPN.  The  average  partial  Ai0  9^  for  the  BPN 
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was  0.31,  compared  to  0.34  for  ART2LDA  (Table  III).  The 
Az  and  of  the  ART2LDA  classifier  were  higher  than 

those  of  the  BPN  in  six  of  the  ten  training/test  groups. 

VI.  Discussion 

In  the  present  study,  a  new  classifier  (ART2LDA)  was 
designed  and  applied  to  the  classification  of  malignant  and 
benign  masses.  The  results  indicated  that  the  ART2LDA 
classifier  had  better  general izability  than  an  LDA  classifier 
alone.  The  ART2  classifier  grouped  the  case  samples  that  were 
different  from  the  main  population  into  separate  classes.  The 
minimum  number  of  classes  needed  to  start  the  clustering  of 
outliers  into  separate  classes  depended  on  how  different  the 
outliers  were  from  the  rest  of  the  sample  population.  For  the 
ten  different  partitions  of  training  and  test  sets  used  in  this 
study,  the  minimum  number  varied  between  13  and  15  classes. 
When  the  number  of  ART2  classes  was  less  than  this  minimum 
number  of  classes,  the  ART2  classifier  generated  only  mixed 
malignant-benign  classes  and  all  samples  were  transferred  to 
the  LDA  stage.  In  that  case,  the  ART2LDA  was  equivalent 
to  the  LDA  classifier  alone.  When  a  higher  number  of  classes 
were  generated,  an  increased  number  of  cases  that  might  be 
considered  outliers  of  the  general  data  population  was  removed 
(clustered  in  separate  classes).  For  the  ten  training  sets  used 
in  this  study,  the  malignant  outliers  were  gradually  removed 
when  the  number  of  classes  increased.  The  training  accuracy 
increased  when  the  number  of  classes  increased  and  Az  could 
reach  the  value  of  1.0.  However,  a  large  number  of  ART2 
classes  led  to  overfitting  the  training  sample  set  and  poor 
generalization  in  the  test  set.  The  classification  accuracy  of 
ART2  for  the  test  set  tended  to  decrease  when  the  number  of 
classes  was  greater  than  about  70.  The  large  number  of  classes 
also  led  to  a  reduction  in  the  generalizability  of  the  second- 
stage  LDA;  the  training  of  LDA  with  a  small  number  of 
samples  would  again  result  in  overfitting  the  training  set,  and 
poor  generalizability  in  the  test  set.  This  effect  was  observed 
when  more  than  60  or  70  classes  were  generated  by  ART2 
(see  Figs.  5  and  6). 

The  classification  accuracy  of  ART2LDA  increased  initially 
with  an  increased  number  of  classes  and  then  decreased 
after  reaching  a  maximum.  The  correct  classification  of  the 
outliers  by  the  ART2  in  combination  with  an  improvement 
in  the  classification  by  the  LDA  resulted  in  the  increased 
accuracy.  When  the  number  of  ART2  classes  was  further 
increased,  the  effects  of  overfitting  by  the  ART2  and  the  LDA 
became  dominant  and  the  prediction  ability  of  the  ART2LDA 
decreased.  In  some  cases  the  second-stage  LDA  prediction 
was  much  worse  than  the  ART2.  In  other  cases  the  ART2 
could  not  generalize  well.  The  generation  of  a  high  number  of 
classes  is  therefore  impractical  and  unnecessary  both  from  a 
computational  and  a  methodological  point  of  view. 

For  the  optimal  number  of  classes  (usually  less  than  50  for 
the  data  sets  used)  the  Az  value  for  the  second-stage  LDA  in 
the  ART2LDA  was  better  than  an  LDA  classifier  alone,  but  it 
was  not  as  good  as  the  overall  Az  from  the  ART2LDA.  It  is 
evident  that  the  ART2  was  a  useful  classifier  for  improvement 
of  the  second-stage  classification. 
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When  the  partial  area  of  the  ROC  curve  above  the  true  posi¬ 
tive  fraction  (TPF)  of  0.9  (Ai0'9^)  was  considered  as  a  measure 
of  classification  accuracy,  the  advantage  of  ART2LDA  over 
LDA  alone  became  even  more  evident.  By  removing  and  cor¬ 
rectly  classifying  the  outliers,  the  accuracy  of  the  classification 
was  increased  at  the  high  sensitivity  end  of  the  curve. 

The  classifier  performance  was  evaluated  when  the 
ART2LDA  classifiers  were  designed  using  a  fixed  number 
of  ART2  classes.  The  results  showed  improved  performance 
of  the  ART2LDA  in  a  range  between  20  and  40  ART2 
classes.  Both  the  average  Az  and  the  average  reached 

a  maximum  within  this  region,  and  the  maximum  average  Az 
and  the  average  A^'9^  values  remained  unchanged  between  30 
and  40  classes.  These  results  indicated  that  the  performance 
of  a  hybrid  ART2LDA  classifer  was  robust  and  stable  and 
could  be  potentially  useful  in  real  clinical  applications. 

We  have  performed  statistical  tests  with  the  CLABROC 
program  to  estimate  the  significance  in  the  differences  between 
the  Az  values  from  the  ART2LDA,  the  LDA  alone,  and  the 
BPN,  as  well  as  in  the  differences  in  the  partial  Ai°  from  the 
three  classifiers.  The  statistical  tests  were  performed  for  each 
individual  data  set  partition  because  the  correlation  among  the 
data  sets  from  the  different  partitions  precludes  the  use  of 
student’s  paired  t  test  with  the  ten  partitions.  We  found  that  the 
differences  in  both  cases  did  not  reach  statistical  significance 
because  of  the  small  number  of  test  samples  and  thus  the  large 
standard  deviation  in  the  Az  values.  However,  the  consistent 
'improvements  in  Az  and  a[°'9^  by  the  ART2LDA  (9  out  of 
10  data  set  partitions  in  both  cases  for  LDA  and  six  out  of 
ten  data  set  partitions  in  both  cases  for  BPN)  suggest  that  the 
improvement  was  not  by  chance  alone,  and  that  the  accuracy 
of  a  classification  task  could  be  improved  by  the  use  of  an 
ART2  network.  In  addition,  one  advantage  of  the  ART2LDA 
is  that  the  training  process  is  more  efficient  than  that  of  the 
BPN,  especially  when  there  is  a  subset  of  outlying  samples.  In 
such  a  case,  the  BPN  will  require  a  large  number  of  training 
epochs  to  minimize  the  error  function. 

ART2LDA  can  be  trained  to  classify  the  sample  cases  into 
more  than  two  classes,  such  as  a  class  of  normal  tissue  regions 
in  addition  to  malignant  and  benign  masses.  There  will  be  an 
increase  in  the  complexity  of  training  and  a  larger  training 
sample  size  will  be  desired,  but  these  requirements  will  be 
comparable  for  the  different  classifiers.  In  a  clinical  situation, 
if  the  classification  task  is  performed  on  all  computer-detected 
lesions,  the  classifier  has  to  distinguish  the  falsely  detected 
normal  tissue  from  malignant  or  benign  lesions.  However, 
it  may  be  noted  that  a  classifier  that  can  distinguish  only 
malignant  and  benign  masses  is  applicable  to  the  scenario 
that  the  radiologist  identifies  a  suspicious  lesion  on  the  mam¬ 
mogram  and  would  like  to  have  a  second  opinion  about  its 
likelihood  of  malignancy  before  making  a  diagnostic  decision. 
Therefore,  the  development  of  a  classifier  that  can  differentiate 
malignant  and  benign  masses  is  the  research  of  interest  for 
many  investigators. 

Similarly,  ART2  can  be  trained  to  discover  and  remove  a 
pure  benign  mass  class.  The  approach  will  be  similar  to  the 
task  of  classifying  and  removing  the  pure  malignant  classes. 


1.186 


IEEE  TRANSACTIONS  ON  MEDICAL  IMAGING.  VOL.  18.  NO.  12,  DECEMBER  1999 


as  described  in  this  study.  However,  our  approach  of  removing 
the  malignant  classes  will  reduce  the  chance  of  misclassifica- 
tion  of  malignant  masses.  In  breast  cancer  detection,  the  cost 
of  false-negative  (missed  cancer)  is  very  high.  Therefore,  our 
goal  in  classifier  design  is  to  be  conservative.  By  removing 
the  malignant  classes  in  the  first  stage,  any  misclassification 
to  these  classes  will  be  regarded  as  malignant.  The  remaining 
classes  will  be  classified  again  with  the  second-stage  classifier 
so  malignant  masses  will  be  less  likely  to  be  missed. 

The  problem  of  classification  of  malignant  and  benign 
masses  has  been  studied  by  many  investigators.  Rangayyan 
et  al.  [15]  used  Mahalanobis  distance  classifer  (a  modification 
of  an  LDA  classifier)  and  the  leave-one-out  method  to  evaluate 
the  classification  of  54  masses.  Fogel  et  al.  [16]  compared 
LDA  and  BPN  classifiers  using  the  leave-one-out  method  and 
139  masses  (malignant  and  benign  classification).  Highnam 
et  al.  [17]  used  a  morphological  feature  called  a  halo  to 
classify  40  masses  as  malignant  and  benign.  Huo  et  al.  [22] 
employed  BPN  and  a  rule-based  classifier  to  classify  95  masses 
using  the  leave-one-out  evaluation  method.  Sahiner  et  al  [12] 
used  an  LDA  classifier  and  the  leave-one-out  method  to 
classify  168  masses.  An  important  difference  between  the 
classifier  designed  in  this  study  and  the  previous  studies  in 
the  CAD  field  is  the  method  of  feature  selection.  In  the 
above  mentioned  studies  [12],  [15]-[17],  [22]  and  several  other 
published  studies  [18]— [21]  the  features  were  selected  from  the 
entire  data  set  first,  and  then  the  data  set  was  partitioned  into 
training  and  test  sets.  This  meant  that  at  the  feature  selection 
stage  of  the  classifier  design,  the  entire  data  set  was  used  as  a 
training  set.  Depending  on  the  distribution  of  the  features  and 
the  total  number  of  samples  used,  the  test  results  in  these 
studies  might  be  optimistically  biased  [37].  In  our  current 
study,  the  entire  data  set  was  initially  partitioned  into  training 
and  test  sets  and  then  feature  selection  was  performed  only 
on  the  training  set.  This  method  will  result  in  a  pessimistic 
estimate  of  the  classifier  performance  when  the  training  set  is 
small  [37].  However,  it  will  provide  a  more  conservative  but 
realistic  estimation  of  the  classifier  performance  in  the  general 
patient  population.  We  can  expect  that  the  performance  would 
be  improved  if  the  classifier  in  this  study  were  designed  using 
a  large  data  set.  Since  our  main  purpose  in  this  study  was 
to  compare  the  ART2LDA  classifier  with  the  commonly  used 
LDA  and  BPN,  we  did  not  attempt  to  quantify  how  pessimistic 
our  results  were  in  this  study. 

The  most  important  contribution  of  this  paper  is  to  in¬ 
troduce  a  new  approach  that  utilizes  a  two-stage  unsuper- 
vised-supervised  hybrid  classifier.  We  believe  that  the  hybrid 
approach  will  improve  classification  when  the  sample  distribu¬ 
tion  contains  subpopulations  that  may  be  difficult  for  a  single 
classifier  to  classify.  It  will  be  useful  for  similar  classification 
tasks  although  different  classifiers  may  be  used  in  each  stage 
of  the  hybrid  structure. 

VII.  Conclusion 

A  new  classifier  combining  an  unsupervised  ART2  and 
a  supervised  LDA  has  been  designed  and  applied  to  the 
classification  of  malignant  and  benign  masses.  A  data  set 


consisting  of  348  films  (179  malignant  and  169  benign) 
was  randomly  partitioned  into  training  and  test  subsets.  Ten 
different  random  partitions  were  generated.  For  each  training 
set,  texture  features  were  extracted  and  feature  selection  was 
performed.  An  average  of  features  were  selected  for  each 
group.  A  hybrid  ART2LDA  classifier,  an  LDA,  and  a  BPN 
were  trained  by  using  each  of  the  ten  training  sets.  The  Az 
value  under  the  ROC  curve  for  the  test  sets,  averaged  over 
the  ten  partitions,  was  higher  for  ART2LDA  ( Az  =  0.81) 
compared  to  those  of  the  LDA  alone  ( Az  =  0.78)  and  of  the 
BPN  ( Az  =  0.80).  A  greater  improvement  was  obtained  when 
the  partial  ROC  area  above  a  true-positive  fraction  of  0.9  was 
considered.  The  average  partial  Az  for  ART2LDA  was  0.34, 
as  compared  to  0.27  for  LDA  and  0.31  for  BPN.  Additionally, 
for  the  ART2LDA  classifiers  that  correctly  classified  the 
maximum  number  of  malignant  masses  in  the  test  sets  with 
decision  threshold  defined  with  the  training  set,  the  average 
partial  Az  was  0.35.  These  results  indicate  that  the  hybrid 
classifier  is  a  promising  approach  for  improving  the  accuracy 
of  classifiers  for  CAD  applications. 
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Classifier  design  is  one  of  the  key  steps  in  the  development  of  computer-aided  diagnosis  (CAD) 
algorithms.  A  classifier  is  designed  with  case  samples  drawn  from  the  patient  population.  Generally, 
the  sample  size  available  for  classifier  design  is  limited,  which  introduces  variance  and  bias  into  the 
performance  of  the  trained  classifier,  relative  to  that  obtained  with  an  infinite  sample  size.  For  CAD 
applications,  a  commonly  used  performance  index  for  a  classifier  is  the  area,  Az ,  under  the  receiver 
operating  characteristic  (ROC)  curve.  We  have  conducted  a  computer  simulation  study  to  investi¬ 
gate  the  dependence  of  the  mean  performance,  in  terms  of  Az ,  on  design  sample  size  for  a  linear 
discriminant  and  two  nonlinear  classifiers,  the  quadratic  discriminant  and  the  backpropagation 
neural  network  (ANN).  The  performances  of  the  classifiers  were  compared  for  four  types  of  class 
distributions  that  have  specific  properties:  multivariate  normal  distributions  with  equal  covariance 
matrices  and  unequal  means,  unequal  covariance  matrices  and  unequal  means,  and  unequal  cova¬ 
riance  matrices  and  equal  means,  and  a  feature  space  where  the  two  classes  were  uniformly  dis¬ 
tributed  in  disjoint  checkerboard  regions.  We  evaluated  the  performances  of  the  classifiers  in 
feature  spaces  of  dimensionality  ranging  from  3  to  15,  and  design  sample  sizes  from  20  to  800  per 
class.  The  dependence  of  the  resubstitution  and  hold-out  performance  on  design  (training)  sample 
size  ( Nt )  was  investigated.  For  multivariate  normal  class  distributions  with  equal  covariance  ma¬ 
trices,  the  linear  discriminant  is  the  optimal  classifier.  It  was  found  that  its  A z-  versus-  1/A,  curves 
can  be  closely  approximated  by  linear  dependences  over  the  range  of  sample  sizes  studied.  In  the 
feature  spaces  with  unequal  covariance  matrices  where  the  quadratic  discriminant  is  optimal,  the 
linear  discriminant  is  inferior  to  the  quadratic  discriminant  or  the  ANN  when  the  design  sample  size 
is  large.  However,  when  the  design  sample  is  small,  a  relatively  simple  classifier,  such  as  the  linear 
discriminant  or  an  ANN  with  very  few  hidden  nodes,  may  be  preferred  because  performance  bias 
increases  with  the  complexity  of  the  classifier.  In  the  regime  where  the  classifier  performance  is 
dominated  by  the  \/Nt  term,  the  performance  in  the  limit  of  infinite  sample  size  can  be  estimated  as 
the  intercept  (1/Nt=0)  of  a  linear  regression  of  Az  versus  1  INt .  The  understanding  of  the  perfor¬ 
mance  of  the  classifiers  under  the  constraint  of  a  finite  design  sample  size  is  expected  to  facilitate 
the  selection  of  a  proper  classifier  for  a  given  classification  task  and  the  design  of  an  efficient 
resampling  scheme.  ©  1999  American  Association  of  Physicists  in  Medicine . 
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I.  INTRODUCTION 

With  the  advent  of  digital  imaging  modalities,  computer- 
aided  diagnosis  (CAD)  is  becoming  an  important  area  of 
research  in  medical  imaging.  A  CAD  algorithm  can  detect 
abnormalities  and  classify  disease  or  normal  cases  based  on 
image  and/or  patient  information,  and  thus  provide  a  second 
opinion  to  the  radiologist  in  the  detection  or  diagnostic  deci¬ 
sion  making  process. 

Design  of  classifiers  that  can  accurately  distinguish  nor¬ 
mal  and  abnormal  features  is  a  critical  step  in  the  develop¬ 
ment  of  CAD  algorithms.  It  has  been  shown  that  the  perfor¬ 


mance  of  a  classifier  for  unknown  cases  depends  on  the 
sample  size  used  for  trailing.1  When  a  finite  design  (train¬ 
ing)  sample  size  is  used,  the  performance  is  pessimistically 
biased  in  comparison  to  that  obtained  from  an  infinitely  large 
design  sample.  In  order  to  design  a  classifier  with  a  perfor¬ 
mance  generalizable  to  the  population  at  large,  one  has  to  use 
a  sufficient  number  of  case  samples  that  are  representative  of 
the  population.  However,  the  availability  of  case  samples  is 
often  limited  in  medical  imaging  research.  It  is  therefore  im¬ 
portant  to  study  the  sample-size  dependence  of  different  clas¬ 
sifiers  and  determine  the  most  efficient  way  of  training  a 
classifier,  under  the  constraint  of  a  finite  sample  size. 
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We  note  that  the  concept  of  generalizability  may  be  used 
in  several  technical  senses  when  assessing  the  performance 
of  a  classifier:  one  with  respect  to  mean  classifier  perfor¬ 
mance,  the  other  with  respect  to  the  variance  of  classifier 
performance.  In  many  classifier  design  problems,  one  is  most 
interested  in  investigating  if  the  mean  performance  of  a  clas¬ 
sifier  estimated  from  a  given  set  of  finite  design  samples  can 
be  generalized  to  classification  performance  with  unknown 
test  samples  drawn  from  the  same  population  of  cases.  The 
generalizability  in  this  regard  can  be  observed  from  the  bi¬ 
ases  of  the  mean  performances  in  the  finite  design  set  and  in 
the  test  set  in  comparison  to  the  optimal  performance  esti¬ 
mated  from  an  infinite  design  set.  The  bias  in  the  mean  per¬ 
formance  of  different  classifiers  under  various  input  condi¬ 
tions  is  the  subject  of  investigation  in  this  study.  We  will 
discuss  further  other  interpretation  of  generalizability  in  the 
Discussion  section  of  this  paper. 

A  number  of  investigators  have  studied  the  finite-sample- 
size  problem1-9  Fukunaga1,3  derived  a  general  formulation 
for  the  bias  and  variance  of  a  function,  /,  which  is  to  be 
estimated  from  the  available  samples.  When /is  a  nonlinear 
function  of  the  mean  vectors  and  covariance  matrices  of  two 
feature  distributions,  it  has  been  shown  that  a  bias  results 
from  the  nonlinear  propagation  of  the  finite-sample  variances 
in  the  estimates  of  the  mean  vectors  and  covariance  matrices 
of  the  distributions  through  this  function.  For  multivariate- 
normal  data,  these  variances  are  proportional  to  1  lNt ,  where 
Nt  is  the  design  sample  size,  and  this  dependence  propagates 
into  the  lowest-order  terms  in  the  bias.  The  bias  is  indepen¬ 
dent  of  the  test  sample  size,  Aftest.  All  measures  of  classifier 
performance  that  count  the  fraction  of  times  the  decision 
value  for  an  abnormal  case  exceeds  that  for  a  normal  case 
(independent  of  underlying  distribution),  and  various  mea¬ 
sures  of  error  for  normally  distributed  decision  functions,  are 
nonlinear  functions  of  the  parameters  of  the  underlying  dis¬ 
tributions.  They  are  thus  subject  to  this  effect.  Fukunaga  and 
Hayes3  analyzed  the  finite  sample  effects  on  the  probability 
of  misclassification  (PMC)  of  a  classifier  and  suggested  a 
technique  that  makes  use  of  the  linear  dependence  of  PMC 
on  1  !Nt  to  estimate  the  performance  at  with  a  finite 

sample  set. 

For  the  evaluation  of  medical  diagnostic  systems,  the 
most  commonly  used  performance  index  is  the  area  under 
the  receiver  operating  characteristic  (ROC)  curve,  Az .  We 
have  derived  analytically  that,  for  linear  discriminant  classi¬ 
fiers,  the  classifier  performance  in  terms  of  Az  can  be  ap¬ 
proximated  by  a  linear  function  in  1  lNt ,  under  conditions 
when  higher  order  terms  in  Nt  can  be  neglected.  We  have 
been  investigating  the  dependence  of  Az  on  sample  size  by 
simulation  studies.7-9  Wagner  et  all0,n  have  also  analyzed 
the  effects  of  design  and  test  sample  sizes  on  the  variance 
components  of  the  classifier  performance.  Although  these 
behaviors  depend  strongly  on  the  class  distributions  and  the 
properties  of  the  classifier,  the  studies  will  provide  some  in¬ 
sight  into  the  sample  size  requirements  for  the  design  of 
different  classifiers.  This  work  may  eventually  lead  to  the 
selection  of  an  efficient  resampling  scheme  for  classifier  de¬ 
sign,  as  well  as  the  development  of  a  statistical  test  of  the 


Fig.  1 .  The  sampling  and  evaluation  scheme  of  the  simulation  study. 


sample  size  requirements  and  the  generalizability  of  the 
trained  classifier. 

In  this  paper,  we  will  describe  the  simulation  studies  and 
analyze  the  effects  of  sample  size  on  classifier  performance. 
Several  commonly  used  classifiers,  including  the  linear  dis¬ 
criminant,  the  quadratic  discriminant,  and  the  back- 
propagation  neural  network  will  be  studied  and  compared 
under  different  input  conditions.  Feature  distributions  with 
markedly  different  characteristics  will  be  used  to  represent  a 
variety  of  situations  that  may  be  encountered  in  classification 
problems  for  many  detection  or  diagnostic  tasks. 

II.  MATERIALS  AND  METHODS 

We  performed  simulation  studies  to  evaluate  the  effects  of 
sample  size  on  classifier  design.  Normal  and  abnormal  case 
samples  were  randomly  drawn  from  known  probability  dis¬ 
tributions  of  the  two  classes.  These  samples  were  then  used 
to  design  classifiers  for  differentiation  of  normal  and  abnor¬ 
mal  cases.  The  simulation  approach  assures  that  any  number 
of  case  samples  can  be  obtained  from  populations  with 
known  statistical  properties.  It  thus  allows  evaluation  of  the 
dependence  of  classifier  performance  on  design  sample  size 
and  comparison  of  the  performance  with  theoretically  pre¬ 
dicted  optimal  classification  based  on  the  chosen  probability 
distributions. 

A.  Simulation  study 

The  sampling  and  evaluation  scheme  of  the  simulation 
study  is  shown  in  Fig.  1.  In  this  study,  we  considered  only 
the  situation  in  which  equal  numbers  (  =  Ntot^/2)  of  normal 
and  abnormal  cases  randomly  drawn  from  the  class  distribu¬ 
tions  were  available  in  our  data  set.  A  resampling  strategy 
similar  to  the  technique  suggested  by  Fukunaga  and  Hayes 
was  devised  to  generate  the  Az-vs-l/Nt  curve.  Subsets  of 
Nt  ,Nt  design  samples  were  randomly  drawn  from 

the  available  sample  set,  again  under  the  constraint  that  the 
numbers  of  normal  and  abnormal  samples  were  equal  in  each 
subset,  i.e.,  N f. ;  normal  , abnormal  Aj /2 ( i  1,.../).  A  clas¬ 

sifier  was  designed  by  using  each  subset  of  samples.  The 
random  sampling  of  a  given  subset  from  the  available  set  of 
Ntotal  samples  was  performed  without  replacement,  whereas 
the  random  sampling  of  different  subsets  always  started  from 
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the  same  set  of  Nlota]  samples.  Therefore,  after  drawing  a 
given  design  subset  Nt .,  the  remaining  samples,  Ni0[S]1-Nti 
were  independent  of  the  design  samples  and  used  as  the  test 
samples.  For  simplicity,  the  number  of  design  samples  per 
class  is  denoted  as  N  in  the  following  discussion. 

In  general,  there  are  two  methods,  resubstitution  and  hold¬ 
out,  for  testing  classifier  performance.  In  the  resubstitution 
method,  the  design  sample  set  is  resubstituted  into  the 
trained  classifier  to  test  its  performance,  whereas  in  the  hold¬ 
out  method,  an  independent  test  set  is  used.  It  has  been 
shown1  that,  for  a  Bayes  classifier,  if  the  classifier  is  trained 
with  a  finite  number  of  design  samples,  the  resubstitution 
estimate  of  the  classifier  performance  is  optimistically  biased 
whereas  the  hold-out  estimate  is  pessimisticaly  biased  in 
comparison  to  that  achievable  with  an  infinite  design  sample 
set.  The  mean  performance  obtained  from  the  former  estima¬ 
tion  provides  an  upper  bound  and  that  from  the  latter  pro¬ 
vides  a  lower  bound  on  the  true  classifier  performance.  When 
the  design  sample  size  is  limited,  it  is  important  to  evaluate 
the  hold-out  performance  to  avoid  an  overly  optimistic  pre¬ 
diction  of  the  classifier  performance.  In  the  limit  of  very 
large  sample  size,  the  upper  and  lower  bounds  converge  to¬ 
wards  the  unbiased  estimate. 

In  this  study,  we  evaluated  the  performance  of  the  classi¬ 
fier  using  both  the  resubstitution  and  the  hold-out  methods  as 
a  function  of  finite  design  sample  size  Nt .  In  order  to  reduce 
the  variances  in  the  estimates  of  Az ,  we  randomly  resampled 
without  replacement  each  Nt.  from  the  same  Ntotal  samples 
Np  times,  trained  and  tested  the  classifier,  and  estimated  the 
average  Az  from  the  Np  individual  Az  s  as  shown  in  Fig.  1. 
The  resubstitution  or  hold-out  Az-  vs- 1  !Nt  curve  was  plotted 
from  the  j  points  and  the  unbiased  estimate  of  A  z  in  the  limit 
of  Nt->™  could  be  extrapolated  from  either  curve. 

This  method  of  estimating  classifier  performance  at  large 
Nt  by  generating  a  few  data  points  at  finite  sample  sizes  is 
similar  to  the  Fukunaga  and  Hayes  technique.  However,  we 
did  not  assume  that  the  j  points  were  in  the  linear  region  of 
the  Az-vs- 1  !Nt  curve  and  we  used  resampling  to  reduce  the 
variances.  In  fact,  one  of  the  goals  of  this  study  was  to  in¬ 
vestigate  the  range  of  design  sample  size  in  which  the  per¬ 
formance  curve  was  approximately  linear  for  various  classi¬ 
fiers  and  probability  distributions  of  the  class  populations. 
Therefore,  we  used  a  much  larger  total  number  of  samples 
(^total-2000)  in  our  simulation  study  than  was  generally 
available  for  classifier  design.  We  could  then  choose  Nt  over 
a  wide  range  and  study  the  behavior  of  the  entire  Az- vs-  UNt 
curve. 

To  estimate  the  population  mean  of  Az  at  each  Nt  9  we 
repeated  the  above  experiment  Ne  times,  each  with  2000 
independently  drawn  samples  from  the  population.  The 
population  mean  of  Az  was  estimated  by  averaging  the  Az 
values  obtained  from  the  Ne  experiments.  We  did  not  ana¬ 
lyze  the  variances  in  this  study  because  of  the  complication 
in  the  correlation  among  the  Np  values  of  Az  introduced  by 
resampling.  A  detailed  analysis  of  the  variances  and  its  mod¬ 
eling  was  performed  in  a  separate  study  by  Wagner  et  al. 10,11 
in  which  a  different  study  design  was  used. 


By  varying  the  number  of  design  samples  per  class,  N, 
over  a  large  range  from  20  to  800,  the  regime  where  the  1  lNt 
dependence  dominated  could  be  observed  from  the  Az  (popu¬ 
lation  mean)-vs-l/N,  (or  l/N)  curves.  It  is  important  to  note 
that,  although  the  number  of  test  samples,  N^—2000 
—  A,.,  varied  from  point  to  point  on  both  the  resubstitution 
and  the  hold-out  curves,  the  bias  in  Az  is  independent  of 
A^^..1  The  shape  of  the  Az-vs-l//V  curve  is  independent  of 
Ntest.  after  Nt .  is  fixed.  However,  the  variance  of  a  given  Az 

does  depend  on  the  test  sample  size. 

For  simplicity,  we  will  refer  to  these  estimates  of  Az 
(population  mean)  as  Az(tr)  for  the  resubstitution  and  as 
Az(ts)  for  the  hold-out  performance  in  the  following  discus¬ 
sions. 


B.  Class  distributions 
1 .  Multivariate  normal  distributions 

For  three  of  the  four  types  of  class  distributions,  we  as¬ 
sumed  that  the  normal  and  abnormal  classes  followed  multi¬ 
variate  normal  distributions  in  the  feature  space.  The  dimen¬ 
sionality  of  the  feature  space,  k,  was  varied  from  3  to  15.  The 
characteristics  of  the  multivariate  normal  distributions  can  be 
completely  specified  by  the  multivariate  mean  vector  of  the 
rth  class,  denoted  as  fJ,r{r=  1,2)  and  its  covariance  matrix, 
denoted  as  £r.  The  separation  of  the  normal  and  abnormal 
classes  is  measured  by  the  Bhattacharyya  distance,  B ,  de¬ 
fined  as1,12 


1  det[(S!+S2)/2] 

B=  —  A  +  -  In  —  , 

8  2  ^/detSi  VdetS 


where  detSr  denotes  the  determinant  of  £r,  and  A  is  the 
squared  Mahalanobis  distance,12  defined  as 


~1 

(^2-Atl). 


(2) 


The  Mahalanobis  distance  is  the  Euclidean  distance  between 
the  means  of  the  two  distributions,  normalized  by  the  square 
root  of  the  average  of  their  covariance  matrices.  It  can  there¬ 
fore  be  considered  to  be  a  measure  of  the  signal-to-noise 
ratio  (SNR)  between  the  abnormal  and  the  normal  distribu¬ 
tions.  The  second  term  of  B  is  the  contribution  from  the 
difference  in  the  covariance  matrices  of  the  two  class  distri¬ 
butions.  If  the  covariance  matrices  are  equal,  the  second  term 
will  be  zero  and  the  Bhattacharyya  distance  will  be  equal  to 
1/8  of  the  squared  Mahalanobis  distance. 

In  the  current  study,  three  types  of  multivariate  normal 
class  distributions  were  considered.  In  the  following  discus¬ 
sion,  we  shall  refer  to  the  use  of  simultaneous  diagonaliza- 
tion  for  the  two  covariance  matrices  of  the  class  distribu¬ 
tions.  This  operation  leaves  the  normal-based  decision 
functions  unchanged  because  the  distance  measures  that  arise 
in  these  decision  functions  are  invariant  to  any  non-singular 
linear  transformation.1 

(1)  Equal  covariance  matrices  and  unequal  means:  In 

this  case,  the  covariance  matrices  of  the  normal  and  abnor¬ 
mal  class  distributions  can  be  simultaneously  diagonalized 
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Fig.  2.  A  schematic  illustration  of  the  two  class  distributions  with  equal  ^ig.  3-  A  schematic  illustration  of  the  two  class  distributions  with  unequal 

covariance  matrices  and  unequal  means  in  a  2D  feature  space.  The  circles  covariance  matrices  and  unequal  means  in  a  2D  feature  space.  The  closed 

represent  contours  of  equal  probability  in  each  distribution.  curves  represent  contours  of  equal  probability  in  each  distribution. 


and  the  variances  of  the  individual  feature  components  can 
be  scaled  to  unity.  Therefore,  without  loss  of  generality,  the 
covariance  matrices  of  the  two  classes  could  be  assumed  to 
be  equal  to  identity  matrices,  Si  =  X2=/.  The  mean  feature 
vector  for  the  first  class  was  assumed  to  be  zero,  —  0,  and 
the  mean  feature  vector  for  the  second  class,  fi2= M  with  all 
components  of  M  equal  to  a  constant  m.  The  magnitude  of  m 
could  be  adjusted  to  obtain  a  desired  separation  of  the  two 
classes.  For  the  purpose  of  this  simulation  study,  we  chose  m 
such  that  the  squared  Mahalanobis  distance  was  3,  i.e.,  the 
Bhattacharyya  distance  was  3/8,  for  feature  spaces  of  any 
dimensionality.  As  discussed  below,  this  separation  corre¬ 
sponds  to  a  theoretical  Az  of  0.89,  which  is  in  the  perfor¬ 
mance  range  of  many  classification  problems  in  CAD  appli¬ 
cations.  An  example  of  the  two  class  distributions  in  a  2D 
feature  space  is  shown  schematically  in  Fig.  2. 

(2)  Unequal  covariance  matrices  and  unequal  means: 
The  covariance  matrix  of  the  first  class  was  again  diagonal¬ 
ized  and  scaled  to  be  an  identity  matrix,  2  \  —  /,  and  the  mean 
feature  vector  for  the  first  class  was  assumed  to  be  zero, 
fjLX  —  0.  The  covariance  matrix  of  the  second  class,  X2,  was 
simultaneously  diagonalized  to  have  eigenvalues  X,,  i 
=  l,...,fc.  For  this  study,  we  generated  the  values  of  X(  with 
the  simple  relationship: 


(l  lXXjnax  Xmjn) 


i= 


and  evaluated  one  condition  where  Xmin=l,  and  Xmax=2  for 
all  dimensionalities  of  the  feature  spaces.  We  also  assumed 
that  the  components  of  the  mean  feature  vector  ji2  were 
equal,  the  values  of  which  were  adjusted  to  achieve  a  Bhat¬ 
tacharyya  distance  of  3/8.  For  the  purpose  of  demonstrating 
the  general  trends  of  the  Az-\s-l/N  curves  and  comparing 
the  relative  performance  of  the  different  classifiers  under  the 
various  conditions,  the  specific  choices  of  these  values  are 
not  critical.  Figure  3  illustrates  an  example  of  the  two  class 
distributions  in  a  2D  feature  space. 

(3)  Unequal  covariance  matrices  and  equal  means: 
The  covariance  matrix  of  the  first  class  was  the  same  as  that 
in  the  first  two  cases  described  above.  The  covariance  matrix 
of  the  second  class  was  proportional  to  the  identity  matrix, 
X2—aI,  where  the  proportionality  constant  a  was  adjusted 
to  provide  a  Bhattacharyya  distance  of  3/8.  The  mean  feature 


vectors  of  the  two  classes  were  equal,  fix  — fi2  =  0.  In  this 
case,  the  discriminatory  power  of  the  two  classes  comes  en¬ 
tirely  from  the  difference  in  the  covariance  matrices.  A  sche¬ 
matic  of  the  two  class  distributions  in  a  2D  feature  space  is 
shown  in  Fig.  4. 

2.  Checkerboard  distributions 

The  fourth  type  of  class  distributions  was  a  checkerboard 
where  the  normal  and  abnormal  classes  were  located  in  al¬ 
ternate  square  box  regions  of  the  feature  space.  Within  each 
box  of  the  checkerboard,  the  feature  vectors  were  uniformly 
distributed.  The  two  classes  did  not  overlap  with  each  other 
so  that  they  could  be  perfectly  separated  by  an  “ideal”  clas¬ 
sifier  with  Az=  1.  We  considered  a  2X3  checkerboard  in  a 
2D  feature  space  and  a  2X2X2  checkerboard  in  a  3D  feature 
space.  The  example  of  a  2X3  checkerboard  in  a  2D  feature 
space  is  shown  in  Fig.  5.  Such  class  distributions  may  not  be 
common  in  actual  classification  problems  encountered  in 
CAD.  However,  it  was  included  in  this  study  to  demonstrate 
the  capability  and  limitations  of  the  different  classifiers  when 
the  class  distributions  were  not  multivariate  normal. 

C.  Classifiers 

We  studied  three  types  of  classifiers:  the  linear  discrimi¬ 
nants,  the  quadratic  discriminants,  and  the  back-propagation 
neural  networks.  They  represent  a  range  of  classifiers  com¬ 
monly  used  in  the  field  of  pattern  recognition  at  present. 


f2 


Fig.  4.  A  schematic  illustration  of  the  two  class  distributions  with  unequal 
covariance  matrices  and  equal  means  in  a  2D  feature  space.  The  circles 
represent  contours  of  equal  probability  in  each  distribution. 
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Fig.  5.  An  example  of  a  2X3  checkerboard  in  a  2D  feature  space. 


INPUT  HIDDEN  LAYER  OUTPUT 


Fig.  6.  A  schematic  diagram  of  a  backpropagation  neural  network  with  one 
hidden  layer. 


(1)  Linear  discriminant  classifier:  The  linear  discrimi¬ 
nant  classifier  can  be  derived  from  the  means  and  the  cova¬ 
riance  matrices  of  the  class  distributions  as  follows:1,13 

hl(X)  =  (fi2- Hi-  flit-1  Hi),  (4) 

where  X  =  (X  1+X2 )/2,  and  X  is  the  feature  vector  to  be 
classified.  The  means  and  covariance  matrices  have  to  be 
estimated  as  the  sample  means  and  sample  covariance  matri¬ 
ces  from  the  available  design  samples.  The  sample  means 
and  covariance  matrices  undergo  a  nonlinear  transformation 
to  become  the  discriminant  scores,  which  in  turn  are  trans¬ 
formed  nonlinearly  into  a  measure  of  the  performance.  The 
variances  in  the  estimated  parameters  propagate  into  the 
mean  classifier  performance  and  result  in  a  bias  through  the 
second  derivative  of  the  transformation  function. 

It  is  known  that,  for  multivariate  normal  distributions  with 
equal  covariance  matrices,  the  linear  discriminant  classifier  is 
optimal  and  the  classifier  performance  in  the  limit  of  large 
design  samples  is  determined  by  the  Mahalanobis  distance, 
given  by 


1  P 

\l2rr 


e~u2/2du. 


For  the  class  distributions  with  A  ==  3  to  be  used  in  this  study, 
it  can  be  derived  from  Eq.  (5)  that  the  maximum  Az  that  the 
optimal  linear  discriminant  can  achieve  in  the  limit  of  large 
design  samples  is  0.89. 

(2)  Quadratic  discriminant  classifier:  The  quadratic  dis¬ 
criminant  classifier  can  be  expressed  as1 


Mx)=-(x-/Al)r2r1(x-Atl) 


1 

2 


(X-/,2)rX2-1(X-AL2)+-  In 


detXi 
det  X2 


(6) 


When  the  class  distributions  are  multivariate  normal  with 
unequal  covariance  matrices,  the  quadratic  discriminant  clas¬ 
sifier  is  optimal  in  the  limit  of  large  training  samples.  The 
Bhattacharyya  distance  gives  an  upper  bound  on  the  Bayes 


error.1  The  general  properties  of  the  linear  and  quadratic 
classifiers  have  been  described  in  the  literature  (for  example, 
Fukunaga1). 

(3)  Back-propagation  neural  network:  Many  different 
architectures  and  training  methods  have  been  developed  for 
artificial  neural  networks  (ANN)14  in  various  applications.  In 
this  study,  we  considered  only  a  three-layered  neural  net¬ 
work  trained  with  a  feed-forward  back-propagation  method. 
The  neural  network  has  k  input  nodes,  n  hidden  nodes,  one 
output  node,  and  a  bias  node  in  both  the  input  and  the  hidden 
layers.  The  ANN  architecture  is  denoted  as  k~n~  1.  The 
nodes  in  the  ANN  are  fully  connected  and  are  trained  with  a 
minimum  sum-of-squares-error  criterion.  The  number  of 
weights  to  be  estimated  is  equal  to  «(fc+l)  +  (n+l).  A 
schematic  diagram  of  an  ANN  is  shown  in  Fig.  6. 

III.  RESULTS 

In  our  simulation  study,  we  compared  the  performance  of 
the  linear,  quadratic,  and  backpropagation  neural  network 
classifiers  for  the  different  class  distributions  in  the  feature 
spaces  of  dimensionality  ranging  from  3  to  15.  The  number 
of  repeated  experiments  Ne  was  chosen  to  be  20  for  all  cases 
in  the  multivariate  normal  feature  spaces  and  100  in  the 
checkerboard  feature  space.  The  number  of  data  set  partition¬ 
ings  Np  in  each  experiment  ranged  from  1  to  20.  These 
choices  are  a  compromise  between  computation  time  and 
estimation  accuracy,  especially  for  ANN  classifiers  with  a 
large  number  of  hidden  nodes  in  high  dimensional  feature 
spaces.  As  shown  in  the  graphs  discussed  below,  some  of  the 
performance  curves  may  exhibit  fluctuations  that  could  be 
reduced  by  a  larger  number  of  experiments.  However,  the 
general  trend  of  the  performance  curves  should  not  be 
changed  by  the  statistical  uncertainties. 

(1)  Multivariate  normal  distributions — Equal  covari¬ 
ance  matrices  and  unequal  means:  For  class  distributions 
with  equal  covariance  matrices,  the  linear  discriminant  is 
theoretically  the  optimal  classifier  when  the  design  sample 
size  is  large.  However,  when  the  design  sample  size  is  small, 
the  performances  of  all  classifiers  are  biased.  Figures  7(a)- 
7(c)  show  the  dependence  of  the  Az  obtained  from  resubsti¬ 
tution  (training),  A2(tr),  and  the  Az  obtained  from  the  hold¬ 
out  method  (testing),  Az(ts),  on  l/N  for  the  linear,  ANN,  and 
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Fig.  7.  The  dependence  of  the  A  z  obtained 
from  resubstitution  (training-solid  lines), 
Az(tr),  and  the  Az  obtained  from  the  hold¬ 
out  method  (testing — dashed  lines), 
Az(ts),  on  UN  for  the  class  distributions 
with  equal  covariance  matrices  and  un¬ 
equal  means,  (a)  Linear,  (b)  ANN,  and  (c) 
quadratic  classifier.  Legend:  F3=3D  fea¬ 
ture  space,  etc. 
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Fig.  8.  The  performances  of  the  classifiers 
for  class  distributions  with  unequal  cova¬ 
riance  matrices  and  unequal  means,  (a) 
Linear,  (b)  ANN  classifier.  Legend: 
F3=3D  feature  space,  etc.,  solid  lines 
=Az(tr),  dashed  lines=Az(ts). 


quadratic  classifier,  respectively.  Two  hidden  nodes  were 
used  for  the  ANN  (/r  —  2  —  1 )  because  it  is  the  smallest  num¬ 
ber  of  hidden  nodes  in  a  nonlinear  ANN.  An  ANN  with  only 
one  hidden  node  will  be  a  linear  classifier  and  behave  in  a 
similar  manner  as  the  linear  discriminant.  On  the  other  hand, 
ANNs  with  a  large  number  of  hidden  nodes  (not  shown)  will 
overfit  the  design  samples  and  have  poor  generalizability  to 
the  unknown  cases,  similar  to  the  ANN  curves  to  be  dis¬ 
cussed  below.  All  three  classifiers  can  reach  the  optimal  clas¬ 
sification  accuracy  of  Az= 0.89  in  the  limit  of  large  N.  The 
curves  for  the  linear  classifier  and  the  ANN  (k  —  2-1)  at 
400  training  epochs  (iterations)  are  approximately  linear  over 
the  entire  range.  The  quadratic  classifier  does  not  reach  the 
approximately  linear  region  until  N  is  greater  than  about  100 
(1/A<0.01)  in  the  higher-dimensional  feature  space.  The  bi¬ 
ases  on  both  the  resubstitution  and  hold-out  curves  for  the 
quadratic  classifier  are  greater  than  those  for  the  linear  clas¬ 
sifier  and  the  ANN  (k-2- 1).  The  large  biases  again  indi¬ 
cate  overfitting  and  poor  generalization  by  the  quadratic  clas¬ 
sifier  in  the  equal-covariance-matrices  situation. 


(2)  Multivariate  normal  distributions — Unequal  cova¬ 
riance  matrices  and  unequal  means:  The  performances  of 
the  classifiers  for  class  distributions  with  unequal  covariance 
matrices  are  shown  in  Figs.  8(a) -8(b).  The  linear  discrimi¬ 
nant  and  the  ANN  (k  —  2-1)  classifier  (not  shown)  are 
again  approximately  linear  over  the  entire  range  of  N  stud¬ 
ied.  However,  the  Az  at  1/N=0  decreases  as  the  dimension¬ 
ality  of  the  feature  space  increases.  This  is  because  both  the 
linear  discriminant  and  the  near-linear  ANN  (k— 2-1)  can¬ 
not  make  use  of  the  class 'separability  due  to  the  differences 
in  the  covariance  matrices  which  is  the  second  term  in  the 
Bhattacharyya  distance.  The  second  term  increases  relative 
to  the  first  term,  the  squared  Mahalanobis  distance,  when  the 
Bhattacharyya  distance  is  fixed  and  the  dimensionality  of  the 
feature  space  increases. 

The  performance  curves  of  the  ANN  at  large  N  improve 
when  a  greater  number  of  hidden  nodes  and  a  sufficient  num¬ 
ber  of  training  epochs  are  used.  The  number  of  hidden  nodes 
required  to  reach  the  optimal  classification  of  Az  =  0.89  at 
1/A=0  increases  with  the  dimensionality  of  the  feature 
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Fig.  9.  The  dependence  of  the  perfor¬ 
mance  curves  on  the  number  of  training 
epochs  for  an  ANN  with  nine  hidden 
nodes  in  a  9D  feature  space:  ANN(9— 9 
-1).  Legend:  it500=500  training  epochs, 
etc.,  solid  lines  dashed  lines 

=Az(ts).  The  expanded  view  in  (b)  shows 
the  trend  of  the  curves  at  large  sample 
sizes. 


(b)  1/No.  of  Training  Samples  per  Class 


space.  Figure  8(b)  shows  the  performance  of  the  ANNs  when 
the  number  of  hidden  nodes  is  equal  to  the  dimensionality  in 
each  feature  space.  Since  the  number  of  weights  to  be  trained 
increases  rapidly  with  increasing  number  of  nodes  in  an 
ANN,  the  number  of  epochs  required  for  training  the  ANN  to 
achieve  a  reasonable  classification  accuracy  increases  ac¬ 
cordingly.  The  resubstitution  and  hold-out  performance 
curves  of  each  ANN  shown  in  Fig.  8(b)  were  chosen  at  the 
smallest  number  of  training  epoch  that  resulted  in  approxi¬ 
mately  the  highest  Az  value  when  the  hold-out  curve  was 
extrapolated  to  1  IN  =  0.  The  number  of  training  epochs  re¬ 
quired  to  reach  the  highest  Az  increased  as  the  dimensional¬ 
ity  and  the  number  of  hidden  nodes  in  the  ANN  increased.  It 
ranged  from  about  4000  to  10000  for  the  conditions  shown 
in  Fig.  8(b).  We  did  not  attempt  to  perform  an  exhaustive 
search  for  the  “optimal”  number  of  hidden  nodes  in  each 
feature  space  because  of  the  extensive  computation  time  re¬ 
quired  for  the  search.  Instead,  we  evaluated  ANNs  with  a 
few  different  numbers  of  hidden  nodes  in  each  feature  space 
and  chose  the  “best”  ANN  within  those  studied.  With  this 


approximation  we  observed  that,  in  a  ^-dimensional  feature 
space  and  with  these  class  distributions,  an  ANN  with  ap¬ 
proximately  k  hidden  nodes  can  approach  the  optimal  perfor¬ 
mance  when  the  design  sample  size  and  the  number  of  train¬ 
ing  epochs  are  sufficiently  large,  as  shown  in  Fig.  8(b). 

To  illustrate  the  training  of  an  ANN  with  a  large  number 
of  hidden  nodes,  we  show  the  dependence  of  the  resubstitu¬ 
tion  and  the  hold-out  curves  on  the  number  of  training  ep¬ 
ochs  for  ANN  (9-9-1)  in  Fig.  9.  A  number  of  commonly 
discussed  problems  of  an  ANN  can  be  observed.  In  the  small 
N  region  below  about  60  samples  per  class,  over- 
parametrization  and  over-training  are  obvious,  i.e.,  near  per¬ 
fect  classification  during  training  [Az(tr)  greater  than  0.95] 
and  poor  generalization  [Az(ts)  below  about  0.8].  The  prob¬ 
lem  becomes  more  pronounced  with  an  increasing  number  of 
training  epochs.  In  the  middle  range  of  200  to  400  samples 
per  class  where  Az(ts)  increases  to  a  maximum  then  de¬ 
creases  with  further  training,  an  “optimal”  number  of  train¬ 
ing  epoch  exists.  Only  in  the  region  with  a  sufficiently  large 
N  (greater  than  about  500  per  class),  Az(ts)  increases  with 
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Fig.  10.  The  dependence  of  the  perfor¬ 
mance  curves  of  an  ANN  on  the  number 
of  hidden  nodes  in  the  9D  feature  space 
for  class  distributions  with  unequal  cova¬ 
riance  matrices  and  unequal  means.  Leg¬ 
end:  F921=ANN  with  two  hidden  nodes, 
etc.,  solid  lines=Az(tr),  dashed  lines 
=At{  ts). 


increasing  number  of  training  epochs  within  the  range  stud¬ 
ied.  The  Az(ts)-vs-l/N  curve  becomes  linear  for  N  greater 
than  about  200.  This  dependence  of  ANN  on  training  epoch 
is  generally  observed  for  ANNs  with  a  large  number  of  hid¬ 
den  nodes  and  in  high-dimensional  feature  spaces,  although 
the  design  sample  size  required  in  order  to  avoid  over¬ 
training  and  over-parametrization  varies.  It  reinforces  our 
general  experience  that  the  ANNs  with  a  large  number  of 
weights  can  overfit  the  design  samples  easily  and  provide 
poor  generalization  when  the  sample  size  is  small. 

The  performance  curves  of  ANNs  with  different  numbers 
of  hidden  nodes  in  the  9D  feature  space  are  shown  in  Fig.  10. 
The  curves  for  a  given  ANN  were  again  chosen  at  a  training 
epoch  in  which  the  hold-out  curve  approached  approximately 
the  highest  performance  at  1/A=0.  The  chosen  training  ep¬ 
och  ranged  from  600  to  12  000  for  the  2-  to  15-hidden-node 
ANNs  shown.  When  the  number  of  hidden  nodes  is  small, 
the  highest  Az  obtained  by  extrapolation  to  l/N=0  appears 
to  be  below  the  theoretical  optimum  of  0.89.  For  example, 


the  Az  extrapolated  to  1/A— 0  is  about  0.85  for  ANN  (9—2 
—  1),  and  is  about  0.87  for  ANN  (9—6—1).  The  ANN  with 
nine  hidden  nodes  appears  to  approach  the  optimal  Az  of 
0.89  in  the  limit  of  1/A=0.  However,  the  ANN  (9-9-1) 
does  not  reach  the  approximately  linear  region  until  N  is 
greater  than  about  200  (easier  to  see  in  Fig.  9).  As  can  be 
seen  from  the  hold-out  curves,  increasing  the  number  of  hid¬ 
den  nodes  further  will  increase  overfitting,  reduce  generaliz- 
ability,  and  increase  train  time  without  gaining  true  improve¬ 
ment  in  performance  for  classification  of  unknown  case 
samples. 

The  quadratic  classifier  is  the  theoretically  optimal  classi¬ 
fier  for  the  class  distributions  with  unequal  covariance  ma¬ 
trices.  It  can  optimally  utilize  the  class  separability  contrib¬ 
uted  by  both  the  differences  in  the  means  and  the  covariance 
matrices.  The  performance  curves  for  the  quadratic  classifier 
(not  shown)  in  feature  spaces  of  different  dimensionalities 
are  very  similar  to  those  obtained  for  the  equal  covariance 
matrices  situation  [Fig.  7(c)].  The  Az  of  the  quadratic  classi- 


Fig.  11.  Comparison  of  the  performance 
curves  of  the  linear,  quadratic,  ANN(9-2 
-1),  and  ANN(9— 9— 1)  classifiers  in  the 
9D  feature  space  for  class  distributions 
with  unequal  covariance  matrices  and 
unequal  means.  Legends:  L= linear; 
Q = quadratic,  ANN = neural  network, 
solid  lines =Az(tr),  dashed  lines =Az(ts). 
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(a)  1/No.  of  Training  Samples  per  Class 


1/No.  of  Training  Samples  per  Class 


Fig.  12.  The  dependence  of  the  perfor¬ 
mance  curves  on  dimensionality  of  feature 
space  for  the  class  distributions  with  un¬ 
equal  covariance  matrices  and  equal 
means,  (a)  Linear,  (b)  ANN  classifier. 
Legend:  F3=3D  feature  space,  etc.  F921 
=ANN  with  two  hidden  nodes,  etc.  solid 
lines=Az(tr),  dashed  lines =Az(ts). 


fier  reaches  the  optimal  value  of  0.89  in  the  limit  of  large  N 
for  all  dimensionalities  studied. 

Figure  1 1  shows  a  comparison  of  the  performance  of  the 
linear,  quadratic,  and  the  ANN  classifiers  with  two  and  nine 
hidden  nodes.  The  biases  on  the  resubstitution  and  the  hold¬ 
out  curves  of  the  quadratic  classifier  are  not  as  large  as  those 
of  the  ANN  (9—9—1)  classifier.  However,  in  the  regime  of 
small  design  sample  sizes,  the  hold-out  curve  of  the  optimal 
quadratic  classifier  can  be  much  lower  than  the  correspond¬ 
ing  curves  of  the  linear  classifier  or  ANN  with  one  or  two 
hidden  nodes.  This  result  indicates  that  the  theoretically  op¬ 
timal  classifier  may  not  be  the  optimal  choice  when  the 
available  design  sample  size  is  small  and  over- 
parametrization  becomes  an  important  consideration. 

(3)  Multivariate  normal  distributions — Unequal  cova¬ 
riance  matrices  and  equal  means:  Figure  12(a)  shows  the 
dependence  of  Az  on  1  IN  for  the  linear  classifiers  for  the 
class  distributions  with  equal  means.  Since  the  Mahalanobis 
distance  is  zero  when  the  means  of  the  two  class  distribu¬ 
tions  are  equal,  the  linear  classifier  performs  no  better  than 


random  guessing  in  the  hold-out  situation  (Az(ts)  =  0.5). 
However,  it  is  somewhat  surprising  that  the  resubstitution 
curve  can  be  biased  to  very  high  Az  values,  when  the  design 
sample  is  small.  The  bias  increases  with  increasing  dimen¬ 
sionality  of  the  feature  space  because  the  severity  of  overfit¬ 
ting  to  the  design  samples  worsens  with  increased  parameter¬ 
ization  in  the  linear  discriminant  function.  This  indicates  that 
the  predicted  performance  of  a  classifier  can  be  unrealisti¬ 
cally  optimistic  if  the  test  samples  are  not  independent  of  the 
design  samples. 

For  the  class  distributions  with  equal  means,  it  is  much 
more  difficult  to  train  the  ANN  classifier.  The  number  of 
hidden  nodes  and  the  number  of  training  epochs  required  for 
the  ANN  to  approximate  the  decision  surfaces,  which  are 
spherical  hypersurfaces  in  the  ^-dimensional  feature  space, 
increase  as  k  increases.  Figure  12(b)  shows  the  Az-\s-l/N 
curves  for  the  ANNs  in  which  the  number  of  hidden  nodes  is 
2  times  the  dimensionality  of  the  feature  space.  The  number 
of  training  epochs  required  to  approach  the  highest  perfor- 
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Fig.  13.  (a)  The  dependence  of  the  perfor¬ 
mance  curves  of  an  ANN  on  the  number 
of  hidden  nodes  in  the  9D  feature  space 
for  class  distributions  with  unequal  cova¬ 
riance  matrices  and  equal  means.  In  the 
expanded  scale  (b),  the  approximately  lin¬ 
ear  regions  of  the  curves  can  be  observed. 
Solid  lines=A2(tr),  dashed  lines=Az(ts). 


mance  for  a  given  ANN  architecture  ranges  from  about  1800 
to  20000  in  these  cases.  Again  we  did  not  attempt  an  ex¬ 
haustive  search  for  the  “optimal”  number  of  hidden  nodes 
in  each  case.  These  ANNs  were  chosen  because  they  appear 
to  approach  the  maximum  performance  of  Az  =  0.89  in  the 
limit  of  large  N  and  their  number  of  hidden  nodes  is  a  simple 
multiple  of  the  dimensionality.  Compared  to  the  class  distri¬ 
butions  with  unequal  means,  for  a  given  dimensionality,  the 
number  of  hidden  nodes  and  the  number  of  training  epochs 
required  for  achieving  the  near  maximum  performance  at 
large  N  are  greater  in  this  equal-mean  situation.  Figure  13(a) 
shows  an  example  of  the  dependence  of  the  performance 
curves  on  the  number  of  hidden  nodes  in  the  9D  feature 
space.  Figure  13(b)  is  an  enlarged  view  of  the  curves  in  Fig. 
13(a)  in  the  range  where  the  sample  size  is  greater  than  200 
per  class.  The  hold-out  performance  of  ANN(9-9-l)  at 
l/N=0  reaches  about  0.85.  When  the  number  of  hidden 
nodes  is  greater  than  nine,  the  performances  of  the  ANNs  at 
1/A=0  are  similar  and  approach  the  optimal  Az . 

The  quadratic  discriminant  is  again  the  theoretically  opti¬ 


mal  classifier  for  the  class  distributions  with  unequal  covari¬ 
ance  matrices.  Its  performance  curves  (not  shown)  are  very 
similar  to  those  plotted  in  Fig.  7(c),  except  that  the  extrapo¬ 
lated  A  z  values  at  VN=0  do  not  reach  as  high  as  those  in  the 
equal  covariance  matrices  situation.  By  using  the  approxi¬ 
mately  linear  region  of  the  Az-  vs-  l/N  curve  at  N  greater  than 
100,  the  extrapolated  Az  ranges  from  about  0.873  to  0.885 
for  the  3D  to  15D  feature  spaces.  In  this  case,  it  is  much 
more  efficient  to  train  a  quadratic  discriminant  than  the 
ANN.  Since  the  linear  discriminant  and  ANNs  with  few  hid¬ 
den  nodes  cannot  provide  effective  classification  regardless 
of  the  design  sample  size,  the  quadratic  discriminant  is  ob¬ 
viously  the  optimal  classifier  both  in  terms  of  performance 
and  training  efficiency. 

(4)  Checkerboard  distributions:  In  a  feature  space  with 
checkerboard  class  distributions,  classification  is  difficult  for 
many  classifiers  because  of  the  disjoint  clusters  of  samples 
belonging  to  the  same  class.  We  compared  the  three  classi¬ 
fiers  in  such  a  situation  by  two  examples.  Figure  14  shows 
the  performance  curves  of  the  three  classifiers  in  a  2D  feature 
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Fig.  14.  Performance  curves  of  the  three 
classifiers  for  a  2X3  unit  checkerboard 
in  a  2D  feature  space.  L= linear, 
Q = quadratic,  ANN25 1  =  backpropagation 
neural  network  with  five  hidden  nodes. 
Solid  lines = A z(tr),  dashed  lines=Az(ts). 


space  with  a  2X3  unit  checkerboard  distribution.  Both  the 
linear  and  the  quadratic  discriminants  perform  poorly  even 
for  the  resubstitution  method  where  Az  values  are  in  the 
range  of  0.6  to  0.7.  However,  the  ANN(2— 3  —  1)  can  achieve 
an  Az  of  0.96  (not  shown)  and  the  ANN(2— 5  — 1)  a  near¬ 
perfect  classification  at  a  training  epoch  of  about  1200. 

In  a  3D  feature  space  with  a  2X2X2  unit  checkerboard 
distribution,  the  difficulty  in  classification  experienced  by  the 
linear  and  quadratic  discriminants  is  even  more  apparent. 
Figure  15  shows  that  the  hold-out  curve  of  the  linear  classi¬ 
fier  is  basically  the  same  as  random  guessing.  The  hold-out 
curve  of  the  quadratic  classifier  is  slightly  higher  than  0.5  at 
small  design  sample  sizes  but  approaches  0.5  as  the  design 
sample  increases.  On  the  other  hand,  the  ANN(3~3~1)  can 
attain  a  test  A z  of  0.9  (not  shown)  and  the  ANN(3-5- 1)  can 
reach  near-perfect  classification  at  large  design  sample  sizes 
after  about  1500  training  epochs.  These  two  examples  dem¬ 
onstrate  that  an  ANN  classifier  can  be  superior  to  the  linear 


or  quadratic  classifiers  for  class  distributions  that  are  very 
different  from  the  idealized  multivariate  normal  distribu¬ 
tions. 

IV.  DISCUSSION 

Classifier  design  is  an  important  field  of  research  in 
computer-aided  diagnosis.  Yet  many  of  the  issues  related  to 
classifier  design  have  not  been  explored  systematically.  This 
simulation  study  is  a  part  of  our  on-going  investigation  of  the 
sample  size  effects  on  classifier  design.7-11,15  In  this  study, 
we  evaluated  classifier  performance  for  three  multivariate 
normal  class  distributions  with  specific  properties:  equal  co- 
variance  matrices,  unequal  covariance  matrices,  and  equal 
means.  These  distributions  are  idealized  but  they  do  approxi¬ 
mate  a  range  of  situations  that  may  occur  in  real  classifica¬ 
tion  problems.  Since  the  optimal  classifier  and  the  upper 
bound  of  classification  accuracy  in  the  limit  of  l/N=0  are 


Fig.  15.  Performance  curves  of  the  three 
classifiers  for  a  2X2X2  unit  checkerboard 
distribution  in  a  3D  feature  space.  Legend: 
L- linear,  Q= quadratic,  ANN351 -back- 
propagation  neural  network  with  five  hid¬ 
den  nodes. 
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known  for  each  of  these  cases,  we  can  compare  the  perfor¬ 
mances  of  the  classifiers  under  each  condition  with  the  opti¬ 
mum.  In  addition,  a  checkerboard  class  distribution  was  in¬ 
cluded  in  the  study.  A  comparison  of  the  performances  of  the 
different  classifiers  for  this  class  distribution  can  illustrate 
their  effectiveness  when  the  distributions  are  very  different 
from  multivariate  normal. 

For  all  three  classifiers,  the  Az( tr)  obtained  by  resubstitu¬ 
tion  is  biased  optimistically  while  the  Az(ts)  obtained  by 
testing  with  an  independent  test  set  is  biased  pessimistically, 
relative  to  the  Az  in  the  limit  of  except  for  the  situ¬ 

ations  when  Az(tr)  is  bounded  from  above  by  perfect  classi¬ 
fication  (Az-  1)  or  when  Az(ts)  is  bounded  from  below  by 
random  guessing  (Az  =  0.5).  The  magnitude  of  the  biases 
increases  as  the  design  sample  size  decreases  and  as  the  di¬ 
mensionality  of  the  feature  space  increases.  In  the  cases 
where  a  given  classifier  has  no  discriminatory  power  for  a 
given  class  distribution,  for  example,  the  linear  discriminant 
for  the  equal-mean  or  checker-board  class  distributions,  or 
the  quadratic  discriminant  for  the  3D  checker-board  class 
distribution,  the  test  Az(ts)  remains  almost  constant  at  0.5, 
independent  of  the  design  sample  size.  In  many  cases,  the 
Az-vs-l/iV  curve  cannot  be  approximated  by  a  straight  line 
that  extrapolates  to  the  Az  at  1  IN  =  0  until  the  design  sample 
sizes  are  very  large,  beyond  the  range  of  sample  sizes  that 
are  generally  available  for  CAD  classifier  design.  To  esti¬ 
mate  the  performance  of  a  classifier  at  large  N  under  the 
constraint  of  a  small  design  sample,  one  may  use  the  Fuku- 
naga  and  Hayes  resampling  scheme3  to  derive  several  points 
along  the  Az-vs-l/N  curves  in  the  small  sample  size  region. 
If  the  extrapolated  resubstitution  and  hold-out  curves  do  not 
converge  to  approximately  the  same  Az  at  1/N  =  0,  an  aver¬ 
age  of  the  points  on  the  two  curves  which  correspond  to  the 
same  design  sample  size  may  be  a  closer  estimate  of  Az  than 
either  Az(tr)  or  Az(ts).  It  may  be  noted  that  the  resubstitution 
and  the  hold-out  curves  are  not  biased  symmetrically  from 
the  A  z  at  infinite  N ,  the  average  thus  obtained  will  only  be  a 
rough  estimate.  It  is  also  not  valid  in  cases  when  the  classi¬ 
fier  has  no  discriminatory  power  with  Az(ts)  constant  at 
about  0.5  or  when  the  resubstitution  curve  is  overly  optimis¬ 
tic  with  Az(tr)  constant  at  about  1. 

In  any  case,  caution  should  be  taken  in  estimating  classi¬ 
fier  performance  by  extrapolation  to  l/N—0  or  by  averaging 
the  resubstitution  and  hold-out  performance  as  discussed 
above.  The  estimated  performance  contains  variances  that 
have  to  be  estimated  using  further  tools.  One  such  attempt  in 
estimating  the  components  of  variance  by  a  bootstrapping 
resampling  scheme  has  been  studied  recently  by  Wagner 
et  al.u  These  estimates  reveal  the  amount  of  bias  and  vari¬ 
ance  in  the  classifier  performance  obtained  with  the  finite 
design  samples,  thus  allowing  estimation  of  the  sample  size 
required  to  achieve  a  desired  degree  of  generalizability, 
rather  than  replacing  the  need  for  a  larger  sample  set  and 
further  studies. 

With  the  equal-covariance-matrix  class  distributions,  the 
linear  discriminant  is  the  optimal  classifier  as  expected.  The 
biases  are  low  and  the  computation  is  efficient.  Moreover, 
since  the  Az-vs-l/N  relationship  is  linear  over  almost  the 


entire  range  of  design  sample  sizes,  the  classifier  perfor¬ 
mance  at  very  large  N  can  be  estimated  from  the  small 
sample  size  performance  by  linear  interpolation,  as  sug¬ 
gested  by  Fukunaga  and  Hayes3  and  demonstrated  previously 
by  Wagner  et  al9 

With  the  unequal-covariance-matrices  and  equal-mean 
class  distributions,  the  linear  discriminant  and  the  back- 
propagation  neural  network  with  one  hidden  layer  are  infe¬ 
rior  to  the  quadratic  classifier  when  the  design  sample  size  is 
large.  The  linear  discriminant  cannot  utilize  the  difference  in 
the  covariance  matrices  and  underestimates  the  class  separa¬ 
bility  even  when  an  infinite  number  of  design  samples  is 
available.  The  ANN  needs  a  relatively  large  number  of  hid¬ 
den  nodes  and  a  large  number  of  training  epochs  in  order  to 
reach  the  optimal  performance.  Its  hold-out  performance  and 
the  computation  efficiency  are  both  inferior  to  those  of  the 
quadratic  classifier.  However,  for  the  unequal-covariance- 
matrices  and  unequal-mean  case  and  a  small  design  sample 
size,  the  linear  classifier  or  an  ANN  with  very  few  hidden 
nodes,  e.g.,  n-  2,  provides  better  hold-out  performance  than 
the  more  complex  ANNs  or  the  optimal  quadratic  classifiers. 
These  results  indicate  that  the  bias  on  classifier  performance 
increases  with  increasing  complexity  (loosely  related  to  the 
number  of  parameters  to  be  estimated)  of  the  classifier.  The 
linear  classifier  contains  (&+  1)  independent  parameters  and 
the  quadratic  classifier  contains  (&+  I)(k  +  2)I2  independent 
parameters  in  their  formulations.  The  number  of  weights  to 
be  estimated  for  the  ANN  depends  on  the  number  of  hidden 
nodes  as  n(k+  l)  +  (n+ 1).  The  number  of  weights  in  an 
ANN  can  therefore  easily  exceed  that  of  a  quadratic  classi¬ 
fier,  although  the  estimation  of  the  mean  and  covariance  ma¬ 
trices  for  the  linear  and  quadratic  discriminants  may  contrib¬ 
ute  additional  “complexity”  to  the  classifier  design.  Two 
observations  can  be  made.  First,  when  the  available  sample 
size  is  small,  a  simple  classifier  will  have  better  generaliza¬ 
tion  than  a  more  complex  classifier.  Second,  a  complex  ANN 
or  a  quadratic  classifier  trained  with  an  insufficient  number 
of  design  samples  generalizes  poorly,  even  if  it  is  the  optimal 
classifier  for  the  class  distributions.  It  is  therefore  important 
to  select  an  appropriate  classifier  by  taking  into  consideration 
the  design  sample  size. 

A  further  problem  in  classifier  design  is  that  the  true 
population  distributions  of  the  classes  in  the  feature  space  are 
generally  unknown.  It  was  suggested  that  the  quantile- 
quantile  (Q-Q)  plot  and  the  chi-square  plot  may  be  used  for 
investigating  the  normality  of  univariate  and  multivariate 
sample  distributions,  respectively.16  However,  it  is  still  un¬ 
known  under  what  criteria  the  chi-square  plot  will  indicate 
that  it  is  optimal  to  use  a  classifier  designed  under  the  nor¬ 
mality  assumption.  For  any  measure  of  goodness- of-fit,  when 
the  sample  size  is  small,  only  the  most  aberrant  deviations 
from  the  normal  distribution  can  be  identified  as  a  lack  of  fit 
from  these  plots.16  Therefore,  there  is  often  no  a  priori 
knowledge  to  select  an  “optimal”  classifier  or  to  predict 
whether  the  observed  performance  is  caused  by  the  sample 
size,  the  choice  of  an  overly  complex  classifier,  or  by  an 
actual  poor  separation  of  the  classes  in  the  feature  space.  If 
one  observes  poor  generalization  of  a  trained  classifier  in  a 
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truly  independent  test  set,  it  will  be  important  to  take  into 
consideration  all  these  factors  and  redesign  the  classifier. 

In  this  study,  we  assumed  that  the  best  features  have  al¬ 
ready  been  determined  for  the  classification  task.  In  a  general 
classifier  design  problem,  the  best  set  of  features  usually  has 
to  be  selected  based  on  the  available  design  samples.  The 
feature  selection  step  will  introduce  additional  biases  to  the 
classifier  performance.  The  number  of  features  selected  also 
has  a  strong  influence  on  the  classifier  design,  as  can  be  seen 
from  the  dependence  of  the  bias  on  the  dimensionality  of  the 
feature  space.  The  investigation  of  this  more  complex  situa¬ 
tion  including  both  the  feature  selection  and  classifier  train¬ 
ing  steps  is  underway.17 

The  term  generalizability  is  nonspecific  and  needs  to  be 
qualified  here.  The  present  paper  is  concerned  with  the  gen¬ 
eralizability  of  the  mean  performance  of  classifiers  to  un¬ 
known  test  samples  drawn  from  the  same  population  of 
cases.  We  have  shown  in  this  paper  that  the  mean  perfor¬ 
mance  of  a  classifier  depends  on  the  number  of  samples  used 
to  train  the  classifier,  the  architecture  of  the  classifier,  and — 
for  multivariate-normal  data — the  means  and  covariances  of 
the  population  distributions.  Suppose  in  this  context  that  a 
classifier  is  trained  on  a  given  finite  number  of  design 
samples  (patients).  The  mean  performance  of  the  classifier 
over  independent  replications  with  the  same  number  of  de¬ 
sign  samples  is  generalizable  to  studies  characterized  by  the 
same  number  of  design  samples.  In  other  words,  the  mean 
resubstitution  or  hold-out  performance  is  an  unbiased  esti¬ 
mate  for  repeated  sampling  of  independent  design  and  test 
sample  sets,  respectively,  when  the  same  number  of  design 
samples  is  used.  The  classifier  performance  may  not,  how¬ 
ever,  be  generalizable  to  studies  characterized  by  a  different 
number  of  design  samples.  In  particular,  when  a  very  large 
and  representative  design  sample  size  is  used,  the  mean  per¬ 
formance  may  be  very  different  from  the  mean  performance 
that  characterizes  the  finite-training-sample  condition.  When 
the  mean  performance  under  the  conditions  of  a  finite  design 
sample  size  is  close  to  that  expected  with  a  very  large  design 
sample  size,  the  finite-training  sample  performance  is  said  to 
be  generalizable  to  the  population  performance. 

The  term  generalizability  is  not  only  used  with  respect  to 
mean  performance,  it  is  also  used  with  respect  to  uncertainty 
in  performance,  as  reflected  in  estimates  of  error  bars  (stan¬ 
dard  deviations,  or  the  corresponding  variances).  For  ex¬ 
ample,  if  we  think  of  repeating  a  given  training  and  testing 
experiment  on  a  classifier  and  if  only  the  test  samples  are 
drawn  independently  on  the  repeated  trials,  then  the  esti¬ 
mated  uncertainties  are  said  to  be  generalizable  only  to  a 
population  of  test  samples.  If,  however,  we  think  of  repeat¬ 
ing  the  experiment  and  independently  drawing  new  training 
samples  as  well  as  new  test  samples,  then  the  estimated 
uncertainties  are  said  to  be  generalizable  to  a  population  of 
trainers  and  a  population  of  testers.17  Models  for  the  com¬ 
ponents  of  variance  in  both  paradigms  are  the  subjects 
of  current  work  in  progress.10,11  A  key  point  of  this  latter 
work  is  the  fact  that  for  computer-aided  diagnosis,  most 
available  software  for  ROC  analysis  only  provides  estimates 


of  uncertainty  that  are  generalizable  to  a  population  of  test 
samples. 

In  this  investigation,  we  have  limited  our  study  to  only 
three  types  of  classifiers:  the  linear  discriminant,  the  qua¬ 
dratic  discriminant,  and  the  backpropagation  ANNs  with  one 
hidden  layer.  There  are,  of  course,  many  other  variations  of 
the  ANN  architecture  and  other  parametric  or  non-parametric 
classifiers  available  for  feature  classification  tasks.  The  pur¬ 
pose  of  our  work  is  not  to  exhaustively  evaluate  all  possible 
combinations  of  class  distributions  and  classifiers.  Rather,  by 
limiting  our  investigation  to  some  well-known  situations,  we 
can  perform  systematic  analyses  and  gain  some  insights  into 
the  classifier  design  problems.  Furthermore,  we  have  limited 
our  discussion  here  to  the  estimates  of  the  mean  classifier 
performance.  Wagner  et  al.l0'n  have  investigated  the  vari¬ 
ances  of  classifier  performance  estimated  from  a  finite 
sample  set  and  developed  models  to  study  the  relative  im¬ 
portance  of  the  sizes  of  the  training  and  test  samples.  It  has 
been  demonstrated  that  a  components-of-variance  model  can 
be  estimated  with  a  finite  sample  set  by  using  a  bootstrap 
method.  More  importantly,  the  analysis  of  variances  can  re¬ 
veal  the  generalizability  of  the  performance  estimates  to 
other  training  and  test  sample  sets  in  the  population.  Our 
long  term  goals  are  to  find  some  guidelines  for  designing 
efficient  resampling  schemes  that  can  minimize  the  bias  and 
variance  of  a  trained  classifier  using  the  available  samples, 
and  to  provide  a  quantitative  design  tool  that  can  estimate  the 
design  sample  size  requirement  for  a  larger  “pivotal”  study 
from  the  results  of  a  smaller  “pilot”  study  in  order  to 
achieve  a  desired  precision  in  Az  and  the  desired  generaliz¬ 
ability. 
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Improvement  of  Radiologists' 
Characterization  of 
M am m ogr ap hie  Masses  by 
Using  Computer-aided 
Diagnosis:  An  ROC  Study1 


PURPOSE:  To  evaluate  the  effects  of  computer-aided  diagnosis  (CAD)  on  radiolo¬ 
gists'  classification  of  malignant  and  benign  masses  seen  on  mammograms. 

MATERIALS  AND  METHODS:  The  authors  previously  developed  an  automated 
computer  program  for  estimation  of  the  relative  malignancy  rating  of  masses.  In  the 
present  study,  the  authors  conducted  observer  performance  experiments  with 
receiver  operating  characteristic  (ROC)  methodology  to  evaluate  the  effects  of 
computer  estimates  on  radiologists'  confidence  ratings.  Six  radiologists  assessed 
biopsy-proved  masses  with  and  without  CAD.  Two  experiments,  one  with  a  single 
view  and  the  other  with  two  views,  were  conducted.  The  classification  accuracy  was 
quantified  by  using  the  area  under  the  ROC  curve,  Az . 

RESULTS:  For  the  reading  of  238  images,  the  Az  value  for  the  computer  classifier  was 
0.92.  The  radiologists'  Az  values  ranged  from  0.79  to  0.92  without  CAD  and 
improved  to  0.87-0.96  with  CAD.  For  the  reading  of  a  subset  of  76  paired  views,  the 
radiologists'  Az  values  ranged  from  0.88  to  0.95  without  CAD  and  improved  to 
0.93-0.97  with  CAD.  Improvements  in  the  reading  of  the  two  sets  of  images  were 
statistically  significant  (P  =  .022  and  .007,  respectively).  An  improved  positive 
predictive  value  as  a  function  of  the  false-negative  fraction  was  predicted  from  the 
improved  ROC  curves. 

CONCLUSION:  CAD  may  be  useful  for  assisting  radiologists  in  classification  of 
masses  and  thereby  potentially  help  reduce  unnecessary  biopsies. 


Breast  cancer  is  the  most  prevalent  non-skin  cancer  in  women;  178,700  new  cases  are 
estimated  to  have  occurred  in  1998  (1).  The  mortality  of  breast  cancer  is  the  second  highest 
among  all  cancer  deaths  in  women  (1).  At  present,  there  is  no  effective  method  to  prevent 
breast  cancer.  The  best  approach  to  reducing  the  breast  cancer  mortality  rate  is  early 
detection  and  treatment.  Because  the  mammographic  features  of  early-stage  breast  cancers 
are  not  very  specific,  the  need  for  high  detection  sensitivity  leads  to  biopsy  of  many 
low-suspicion  lesions.  The  positive  predictive  values  (PPVs)  of  mammographic  signs  are, 
therefore,  often  below  30%  (2,3). 

Computer-aided  diagnosis  (CAD)  is  considered  to  be  one  of  the  approaches  that  may 
improve  the  efficacy  of  mammography  (4).  With  CAD,  a  computerized  detection  algorithm 
alerts  a  radiologist  to  the  location  of  the  suspicious  lesions,  and/or  a  trained  computer 
classifier  provides  the  radiologist  with  an  estimate  of  the  likelihood  of  malignancy  of  a 
lesion.  The  radiologist  takes  into  consideration  the  information  provided  by  the  computer 
before  making  a  decision.  This  "second  opinion"  may  improve  the  diagnostic  accuracy 
because  it  serves  as  a  form  of  double  reading  (5).  Furthermore,  a  computer  evaluation  is 
often  more  consistent  and  reproducible  than  a  human  decision  maker  (6). 

Considerable  research  has  been  devoted  to  the  development  of  computerized  schemes 
for  the  detection  and  classification  of  mammographic  abnormalities.  These  efforts  have 
advanced  the  CAD  technology  such  that  clinical  application  appears  to  be  possible  in  the 
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Figure  1.  Histograms  illustrate  the  distributions  of  (a)  size  (ie,  length  of  the  long  axis)  and  (b)  visibility  ranking  (1  ~  obvious,  5  =  subtle)  of  the  253 
masses  included  in  the  data  set.  Because  classification  accuracy  depends  on  the  case  mix,  these  distributions  provided  some  information  on  the 
masses  in  the  data  set. 


near  future.  It  is,  therefore,  necessary  to 
evaluate  the  effects  of  CAD  on  radiolo¬ 
gists'  detection  and  diagnosis  of  mammo- 
graphic  lesions.  In  a  previous  receiver 
operating  characteristic  (ROC)  study,  we 
demonstrated  that  CAD  could  improve 
radiologists'  accuracy  in  the  detection  of 
subtle  microcalcifications  on  mammo¬ 
grams  (7).  Kegelmeyer  et  al  (8)  also  re¬ 
ported  an  improvement  in  radiologists' 
sensitivity  for  the  detection  of  spiculated 
masses  with  use  of  a  computer  aid.  For 
the  classification  of  mammographic  le¬ 
sions,  it  has  been  shown  that  a  computer 
classifier  that  estimated  the  likelihood  of 
malignancy  on  the  basis  of  mammographic 
features  extracted  by  radiologists  could  im¬ 
prove  radiologists'  accuracy  in  distinguish¬ 
ing  malignant  from  benign  lesions  (9-11). 

We  previously  conducted  ROC  studies 
to  compare  the  performance  of  radiolo¬ 
gists  with  that  of  the  computer  (12)  and 
to  compare  radiologists'  ability  to  classify 
masses  with  and  without  CAD  (13).  Jiang 
et  al  (14)  also  performed  an  ROC  study  of 
the  effect  of  CAD  on  radiologists'  perfor¬ 
mance  in  classifying  microcalcifications. 
The  results  of  all  of  these  observer  perfor¬ 
mance  studies  indicate  the  potential  to 
improve  mammographic  interpretation 
with  a  computer  aid. 

We  have  developed  an  automated 
method  to  analyze  masses  seen  on  mam¬ 
mograms  (15-17).  A  mass  is  segmented 
from  its  surrounding  breast  tissue,  and  an 
image  transformation  technique  is  used 
to  transform  the  mass  margin  from  the 
polar  coordinate  system  to  the  Cartesian 
coordinate  system.  A  linear  discriminant 
classifier  then  extracts  the  useful  texture 
features  from  the  transformed  image  and 


merges  them  into  a  relative  malignancy 
rating.  Our  approach  is  different  from 
others  that  use  a  trained  classifier  to 
merge  radiologist-extracted  image  fea¬ 
tures  or  feature  codes  by  using  the  Ameri¬ 
can  College  of  Radiology  Breast  Imaging 
Reporting  and  Database  System  lexicon 
(9-11).  Our  fully  automated  method  has 
the  advantage  that,  unlike  a  human 
reader,  it  does  not  have  variability  in 
feature  recognition  and  coding.  In  addi¬ 
tion,  the  computer  may  be  able  to  extract 
some  information,  such  as  texture  fea¬ 
tures,  that  may  not  be  readily  perceived 
by  human  eyes.  We  conducted  an  ROC 
study  to  evaluate  whether  this  computer  aid 
can  improve  radiologists'  performance  in 
the  classification  of  mammographic  masses 
(13).  The  results  of  our  observer  perfor¬ 
mance  study  are  described  in  this  article. 

Other  investigators  also  have  reported 
on  automated  algorithms  for  the  classifi¬ 
cation  of  mammographic  masses  (18-21). 
The  methods  used  in  these  algorithms 
varied,  and  their  accuracy  in  classifica¬ 
tion  cannot  be  compared  directly  because 
of  the  differences  in  the  data  sets.  How¬ 
ever,  the  effects  of  CAD  on  radiologists' 
performance  are  not  expected  to  depend 
strongly  on  the  specific  algorithm  if  differ¬ 
ent  computer  aids  of  comparable  accuracy 
are  used.  Therefore,  the  applications  of  the 
findings  of  this  study  should  not  be  limited 
to  our  computerized  classification  aid. 

MATERIALS  AND  METHODS 

Data  Set 

The  data  set  for  this  study  consisted  of 
253  mammograms  obtained  in  103  pa¬ 


tients.  Each  image  contained  a  biopsy- 
proved  mass  that  was  evaluated  in  this 
study.  Some  cases  involved  multiple  views 
or  images  from  multiple  examinations. 
The  cases  were  randomly  selected  from 
patient  files  from  the  breast  imaging  divi¬ 
sion  of  a  National  Cancer  Institute- 
designated  national  cancer  center  with 
the  approval  of  the  Institutional  Review 
Board.  The  PPV  of  masses  recommended 
for  biopsy  at  this  center  is  about  25%- 
30%,  but  an  approximately  equal  number 
of  malignant  and  benign  masses  (127  and 
126,  respectively)  were  chosen  to  en¬ 
hance  the  statistical  power  in  this  ob¬ 
server  performance  study.  Any  images 
that  were  judged  to  be  technically  poor 
were  excluded. 

The  mammograms  were  acquired  with 
a  contact  technique.  The  dedicated  mam¬ 
mographic  systems  had  a  molybdenum 
anode  and  molybdenum  filter,  a  0.3-mm 
nominal  focal  spot,  and  a  reciprocating 
grid.  MinR/MinR-E  screen-film  systems 
(Eastman-Kodak,  Rochester,  NY)  were 
used  with  these  units.  Sixty- two  of  the 
malignant  masses  and  six  of  the  benign 
masses  were  judged  to  be  spiculated  by  a 
radiologist  (M.A.H.)  experienced  in  mam¬ 
mography.  The  radiologist  also  measured 
the  size  (ie,  longest  dimension)  and 
ranked  the  visibility  of  the  masses  on  a 
scale  of  1  (obvious)  to  5  (subtle)  relative 
to  the  range  of  visibility  of  masses  encoun¬ 
tered  in  clinical  practice.  For  a  description 
of  the  masses  included  in  the  data  set, 
histograms  of  the  size  and  visibility  of  the 
masses  are  shown  in  Figures  la  and  lb, 
respectively. 

For  the  computer  analysis,  the  selected 
mammograms  were  digitized  with  a  laser 
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Figure  2.  Example  of  rubber-band-straightening  transform  for  extraction  of  texture  features  in  the  margin  region  surrounding  a  mass,  (a)  Original 
and  (b)  background-corrected  images  showing  the  region  of  interest  with  the  mass,  (c)  mammogram  showing  an  outline  of  the  segmented  mass,  and 
(d)  rubber-band-straightening-transformed  image  of  a  40-pixel-wide  region  surrounding  the  segmented  mass. 


imager  (Lumisys  DIS-1000,  Los  Altos,  Ca¬ 
lif)  at  a  pixel  size  of  0.1  x  0.1  mm  and 
12-bit  gray  levels.  This  imager  has  an 
optical  density  range  of  about  0.0-3. 5. 
The  optical  density  on  the  film  was  digi¬ 
tized  linearly  to  pixel  value  at  a  calibra¬ 
tion  of  0.001  optical  density  unit/pixel 
value  in  the  optical  density  range  of 
about  0.0-2. 8.  The  digitizer  deviated  from 
a  linear  response  at  an  optical  density 
higher  than  2.8. 

For  the  observer  experiments,  we  used 
laser-printed  images  of  the  digitized  mam¬ 
mograms  for  all  readings.  The  images 
were  printed  with  a  969HQ  laser  imager 
(Imation,  Oakdale,  Minn)  that  was  con¬ 
nected  to  a  Macintosh  computer  (Apple 
Computer,  Cupertino,  Calif)  through  a 
special  digital  interface.  The  interface  pro¬ 
vided  a  12-bit  in,  10-bit  out  look-up  table 
and  allowed  images  to  be  scaled  to  differ¬ 
ent  factors  with  15  interpolation  meth¬ 
ods.  Because  this  laser  imager  has  a  pixel 
size  of  about  0.085  mm,  we  enlarged  the 
images  by  about  18%  during  printing  to 
maintain  them  at  the  same  size  as  the 
original  mammograms.  One  of  the  inter¬ 
polation  methods  was  chosen  by  an  expe¬ 
rienced  radiologist  (M.A.H.),  who  in¬ 
spected  the  printed  images  with  a 
magnifier  and  evaluated  the  sharpness  of 
the  spicules  and  mass  boundaries.  Be¬ 
cause  of  the  small  pixel  size  used  for  both 


digitization  and  printing,  basically  no 
noticeable  blurring  of  the  masses  could 
be  seen  with  the  chosen  interpolation 
method.  The  images  were  also  inspected 
for  the  potential  contouring  effect  of 
10-bit  output  images,  but  no  noticeable 
artifacts  could  be  found.  A  linear  pixel 
value-to-output  optical  density  calibra¬ 
tion  curve  of  the  laser  imager  was  used  for 
the  printing.  All  images  were  printed 
with  the  same  settings. 

Computerized  Classification 
of  Masses 

Our  computerized  method  of  classify¬ 
ing  mammographic  masses  has  been  de¬ 
scribed  in  detail  previously  (15-17).  The 
method  is  summarized  as  follows:  A  re¬ 
gion  of  interest  that  contained  the  biopsy- 
proved  mass  was  identified  on  the  mam¬ 
mogram  by  the  radiologist.  Background 
correction  based  on  a  distance-weighted 
estimation  method  was  applied  to  the 
region  of  interest  to  reduce  the  low- 
frequency  density  variation  in  the  region. 
A  median-filtered  smoothed  image  and 
two  high-frequency  enhanced  images 
were  generated  from  the  background- 
corrected  region  of  interest.  The  smoothed 
and  enhanced  gray-level  values  at  each 
pixel  were  used  as  features  in  a  k-means 
clustering  algorithm  to  classify  the  pixels 


into  two  clusters;  one  was  the  mass,  and 
the  other  was  the  surrounding  breast 
tissue  background.  By  choosing  an  appro¬ 
priate  criterion,  a  mass  region  slightly 
smaller  than  the  actual  mass  that  was 
visible  on  the  image  was  segmented. 

The  boundary  of  the  segmented  region 
was  smoothed  by  morphologic  filtering. 
A  new  image  transformation  technique, 
referred  to  as  the  rubber-band-straighten¬ 
ing  transform,  was  used  to  transform  a 
40-pixel-wide  region  that  surrounded  the 
segmented  mass  boundary  into  a  rectan¬ 
gular  region.  After  transformation,  the 
mass  margin  became  approximately  par¬ 
allel,  and  any  spicules  that  were  radiating 
from  the  mass  became  approximately  per¬ 
pendicular,  to  the  long  dimension  of  the 
rectangular  region.  The  rubber-band- 
straightening  transform  enabled  the  spic¬ 
ules  to  be  aligned  approximately  in  a 
uniform  direction  and  thus  facilitated  the 
extraction  of  texture  features  from  the 
margin  of  the  mass.  An  example  of  a 
rubber-band-straightening-transformed 
image  is  shown  in  Figure  2. 

Two  types  of  texture  features  were 
found  to  be  useful  for  classification.  The 
first  set  of  features  included  eight  texture 
measures  derived  from  the  spatial  gray- 
level  dependence  matrices  of  the  rubber- 
band-straightening-transformed  image.  A 
spatial  gray-level  dependence  matrix  ele- 
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Figure  3.  Histogram  of  the  test  discriminant  scores  of  the  253  masses  RELATIVE  MALIGNANCY  RATING 

obtained  from  the  linear  discriminant  classifier  by  using  a  "leave  one  Figure  4.  Binormal  distribution  fitted  to  the  histogram  of  the 

case  out"  training  and  test  resampling  scheme.  For  this  classifier,  a  discriminant  scores  of  the  malignant  and  benign  masses.  The  discrim- 

smaller  discriminant  score  corresponded  to  a  higher  likelihood  of  inant  scores  were  linearly  transformed  into  a  relative  malignancy 

malignancy.  The  discriminant  scores  were  used  as  the  decision  rating  ranging  from  1  to  10,  where  1  corresponded  to  the  most  benign 

variable  in  the  ROC  analysis  of  classification  performance.  rating  and  10  corresponded  to  the  most  malignant  rating.  This 

binormal  distribution  was  shown  to  the  observers  during  the  training 
session  to  explain  the  rating  scale  of  the  computer  classifier. 


mentp0fd(i,/)  is  the  joint  probability  of  the 
occurrence  of  gray  levels  i  and  /  for  pixel 
pairs  that  are  separated  by  a  distance  d 
and  at  a  direction  6  (22).  For  analysis  of 
the  masses,  the  spatial  gray-level  depen¬ 
dence  matrices  were  constructed  for  10 
pixel  distances  (d  —  1,  2,  3,  4,  6,  8,  10,  12, 
16,  20  pixels)  and  in  four  directions  (0°, 
45°,  90°,  135°)  relative  to  the  mass  bound¬ 
ary.  Therefore,  a  total  of  320  spatial  gray- 
level  dependence  texture  features  were 
extracted. 

The  second  set  of  texture  features  was 
derived  from  the  run  length  statistics 
matrices  of  the  horizontal  and  vertical 
gradient  images  of  the  rubber-band- 
straightening-transformed  margin  region. 
Five  texture  measures  were  extracted  from 
the  run  length  statistics  matrix  in  each  of 
the  two  directions  (0°  or  90°)  on  each 
gradient  image.  A  total  of  20  run  length 
statistics  texture  features  were  thus  ob¬ 
tained.  Therefore,  we  had  a  total  of  340 
features  from  the  two  types  of  texture 
measures. 

A  stepwise  linear  discriminant  feature 
selection  procedure  (23)  was  used  to  se¬ 
lect  the  most  effective  features  from  the 
available  feature  set.  A  total  of  41  features 
were  selected.  The  selected  features  were 
input  into  the  Fischer  linear  discriminant 
classifier  (24)  as  predictor  variables.  A 
"leave  one  case  out"  resampling  scheme 
was  used  to  train  and  test  the  classifier.  A 
histogram  illustrating  the  test  discrimi¬ 
nant  scores  of  the  253  masses  is  shown  in 
Figure  3.  For  this  classifier,  a  smaller  dis¬ 
criminant  score  corresponded  to  a  higher 
likelihood  of  malignancy.  By  using  the 
test  discriminant  score  as  the  decision 
variable,  the  performance  of  the  com¬ 
puter  classifier  could  be  evaluated  by  us¬ 


ing  ROC  analysis  (17,25,26)  and  com¬ 
pared  with  that  of  the  radiologists,  as 
described  later. 

Relative  Malignancy  Rating 
of  the  Masses 

For  the  observer  performance  study,  we 
provided  a  relative  malignancy  rating  of 
each  mass  to  the  observer  during  the 
reading  session  with  CAD.  The  relative 
malignancy  rating  was  obtained  by  tak¬ 
ing  a  linear  transformation  of  the  com¬ 
puter  classifier's  decision  variable  to  a 
range  of  1-10  and  rounding  the  value  to 
the  nearest  integer.  The  transformation 
also  reversed  the  relative  magnitude  of 
the  decision  variables  so  that  1  corre¬ 
sponded  to  the  highest  benignity  rating, 
and  10  corresponded  to  the  highest  malig¬ 
nancy  rating. 

The  purpose  of  the  transformation  was 
to  provide  a  simple  and  intuitive  relative 
scale  for  the  observer.  Because  the  trans¬ 
formation  was  linear  and  monotonic,  the 
distributions  of  the  normal  and  abnormal 
samples,  as  well  as  their  ROC  curves,  were 
not  affected,  with  the  exception  of  a 
small  error  caused  by  making  the  deci¬ 
sion  variables  discrete.  Furthermore,  the 
slope  a  and  intercept  b  parameters  that 
were  fitted  to  the  transformed  discrimi¬ 
nant  scores  for  the  normal  and  abnormal 
samples  by  using  the  labroc  program  (26) 
were  used  to  generate  a  binormal  distribu¬ 
tion.  The  fitted  binormal  distribution  with 
the  relative  malignancy  rating  on  a  1-10 
scale  (Fig  4),  together  with  the  computer's 
ROC  curve,  were  shown  and  explained  to 
the  observers  during  a  training  session. 


Observer  Performance  Study 

Two  ROC  experiments  (27)  were  con¬ 
ducted:  The  masses  were  evaluated  from  a 
single  view  in  the  first  experiment  and 
from  two  views  in  the  second  experi¬ 
ment.  The  location  of  the  biopsy-proved 
mass  was  marked  on  each  image  so  that 
the  correct  mass  was  evaluated  by  all 
observers.  The  observers  were  instructed 
to  ignore  any  other  possible  masses  on 
the  images.  Six  radiologists  (M.A.H., 
M.A.R.,  T.E.W.,  D.D.A.,  C.P.,  J.S.N.)  who 
are  approved  by  the  Mammography  Qual¬ 
ity  Standards  Act  and  have  7-20  years  of 
experience  in  interpreting  mammograms 
participated  in  the  observer  performance 
experiments. 

There  were  two  reading  sessions  in 
each  experiment— one  with  CAD  and  the 
other  without  CAD.  The  observers  were 
asked  to  rate  the  likelihood  of  malig¬ 
nancy  of  the  masses  on  a  10-point  confi¬ 
dence  rating  scale  under  all  reading  condi¬ 
tions.  In  the  first  session,  half  the 
observers  interpreted  the  images  without 
CAD,  and  the  other  half  interpreted  them 
with  CAD.  The  two  reading  sessions  in 
the  same  experiment  were  separated  by  at 
least  3  weeks,  and  the  two  experiments 
were  separated  by  6  months.  For  all  four 
reading  sessions,  the  observer  had  unlim¬ 
ited  time  to  read  each  case.  To  estimate 
the  average  reading  time  per  case  for  each 
observer,  the  reading  time  for  each  case 
was  recorded  by  using  a  stopwatch. 

In  the  first  experiment,  the  data  set  of 
253  single-view  mammograms  was  di¬ 
vided  into  a  training  set  of  15  mammo¬ 
grams  and  a  study  set  of  238  mammo- 
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Figure  5.  ROC  curve  for  computerized  classi¬ 
fication  of  the  238  masses  used  in  the  observer 
performance  study  with  single-view  reading. 
The  computer's  ROC  curve  can  be  compared 
with  the  radiologists'  ROC  curves  obtained 
from  the  single-view  reading  experiment  illus¬ 
trated  in  Figures  6  and  8. 


grams  (117  benign,  121  malignant).  In 
each  reading  session,  training  was  con¬ 
ducted  before  the  reading  of  the  study 
images.  For  the  reading  session  with  CAD, 
the  fitted  binormal  distributions  of  the 
computer  rating  scores  (Fig  4)  for  the 
entire  data  set  were  explained  to  the 
observer  during  training  to  familiarize 
the  observer  with  the  computer's  rating 
scale.  The  computer  rating  of  the  mass 
was  displayed  on  each  image.  After  read¬ 
ing  each  training  image,  the  observer  was 
told  the  results  of  biopsy  of  the  mass. 

Each  observer  read  the  entire  data  set  in 
one  reading  session.  The  order  of  the 
study  images  was  randomized  by  a  ran¬ 
dom  number  generator.  The  random  se¬ 
quence  was  different  for  each  observer 
and  for  each  reading  session  by  the  same 
observer.  For  the  reading  session  with 
CAD,  the  observer  was  free  to  look  at  the 
computer  rating,  which  was  displayed  on 
the  image,  either  before  or  after  estimat¬ 
ing  the  likelihood  of  malignancy  of  the 
mass.  However,  each  observer  was  asked 
to  always  read  the  computer  rating  before 
making  a  final  decision.  The  observer  was 
not  informed  of  the  pathologic  results  of 
any  mass  on  the  study  images. 

The  second  experiment  was  very  simi¬ 
lar  to  the  first  experiment.  From  the  238 
single-view  mammograms,  76  matched 
pairs  (37  benign,  39  malignant)  of  cranio- 
caudal  and  mediolateral  oblique  or  lateral 
views  were  found.  Another  six  pairs  of 
two-view  mammograms  were  identified 
from  the  rest  of  the  images  and  used  as 
training  cases.  The  remaining  mammo¬ 
grams  were  either  single-view  images  or 
additional  views  of  the  pairs  already  cho- 
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sen,  so  they  were  not  used  in  this  experi¬ 
ment.  In  this  experiment,  the  observers 
were  not  informed  of  the  pathologic  re¬ 
sults  of  any  study  case  in  any  reading 
session.  The  76  pairs  of  mammograms 
were  read  in  one  reading  session  by  each 
observer. 

For  the  reading  session  with  CAD,  the 
rating  of  the  mass  in  each  view  was 
displayed  on  the  respective  image.  The 
computer  ratings  of  the  mass  on  the  two 
views  were  generally  different.  It  was  up 
to  the  observer  to  decide  how  to  merge 
the  two-view  information.  Observers  were 
asked  to  give  a  single  rating  of  the  mass 
after  reading  both  views. 

ROC  Analysis 

The  confidence  ratings  of  each  ob¬ 
server  obtained  from  each  reading  condi¬ 
tion  were  analyzed  by  using  ROC  method¬ 
ology,  and  the  classification  accuracy  was 
quantified  by  using  the  area  under  the 
ROC  curve,  Az.  A  maximum  likelihood 
estimation  of  the  binormal  distribution 
was  fitted  to  the  confidence  ratings  by 
using  the  labroc  program.  This  program 
provides  an  estimate  of  the  Az  and  of  the  a 
and  b  parameters  of  the  ROC  curve.  The 
statistical  significance  of  the  difference  in 
Az  between  the  reading  with  CAD  and 
that  without  CAD  was  estimated  with 
two  methods:  One  was  the  Student  paired 
t  test  for  observer-specific  paired  data;  the 
other  was  the  Dorfman-Berbaum-Metz 
method  for  analysis  of  multireader,  multi¬ 
case  ROC  data  (28).  The  statistical  signifi¬ 
cance  of  the  difference  in  Azfor  reading 
single-view  and  two-view  mammograms 
was  estimated  by  using  the  Student  paired 
t  test  for  the  six  observers.  The  Student 
paired  t  test  takes  into  account  the  statisti¬ 
cal  variation  of  readers,  whereas  the  Dorf- 
man-Berbaum-Metz  method  considers 
both  reader  variation  and  case  sample 
variation  by  means  of  an  analysis  of  vari¬ 
ance  approach.  Therefore,  the  results  of 
Dorfman-Berbaum-Metz  analysis  can  be 
generalized  to  the  population  of  readers 
as  well  as  to  the  population  of  case 
samples. 

Positive  Predictive  Value 

An  ROC  curve  represents  the  entire 
range  of  operating  conditions  of  a  diag¬ 
nostic  process  and  is  independent  of  dis¬ 
ease  prevalence.  When  the  disease  preva¬ 
lence  is  known,  any  operating  point  on 
an  ROC  curve  can  be  used  to  derive  the 
PPV  and  the  corresponding  false-negative 
fraction  (false-negative  fraction  =  1  - 


true-positive  fraction)  on  the  basis  of  the 
following  relationship:  PPV  =  TPF  X  P(M)/ 
[TPF  X  P(M)  +  FPF  X  P(B)],  where  TPF  is 
the  true-positive  fraction,  FPF  is  the  false¬ 
positive  fraction  at  the  chosen  decision 
threshold,  and  P(M)  and  P(B)  are  the 
prevalences  of  malignant  and  benign 
cases,  respectively.  By  varying  the  deci¬ 
sion  threshold,  the  dependence  of  the 
PPV  on  the  false-negative  fraction  can  be 
derived. 

Because  our  data  set  did  not  include 
masses  on  which  biopsy  had  not  been 
performed,  the  ROC  curves  obtained  in 
this  study  cannot  be  generalized  to  pre¬ 
dict  the  performance  of  the  computer 
classifier  and  radiologists  in  clinical  prac¬ 
tice.  However,  to  demonstrate  the  pos¬ 
sible  effect  of  CAD  on  the  PPV  in  the 
population  of  masses  in  which  biopsy  is 
likely  to  be  performed  under  the  current 
clinical  criteria,  we  can  estimate  the  PPV 
by  using  the  prevalence  of  the  malignant 
and  benign  masses  in  this  patient  group. 
Because  the  PPV  of  masses  sent  for  biopsy 
ranges  from  about  25%  to  44%  in  general 
and  from  about  25%  to  30%  at  our  institu¬ 
tion,  for  the  purposes  of  our  estimation, 
we  assumed  that  the  P(M)  was  25%  and 
the  P(B)  was  75%  in  this  population.  A 
higher  prevalence  of  malignant  cases 
would  cause  an  increase  in  the  PPV,  but 
the  trend  between  the  PPV  curves  with 
and  without  CAD  would  be  similar. 


RESULTS 


The  ROC  curve  illustrating  the  perfor¬ 
mance  of  the  computer  classifier  for  the 
238  study  mammograms  is  shown  in 
Figure  5.  The  ROC  curve  for  the  entire  set 
of  253  mammograms  (not  shown)  was 
almost  identical  to  that  of  the  238  study 
cases;  this  indicates  that  the  15  training 
cases  were  typical  of  the  238  cases  used  in 
the  study.  The  Az  values  (±  SD)  for  both 
ROC  curves  were  0.92  ±  0.02. 

For  the  first  experiment  of  reading  the 
238  single-view  mammograms,  the  ROC 
curves  for  the  readings  by  the  six  radiolo¬ 
gists  both  without  and  with  CAD  are 
shown  in  Figures  6a  and  6b,  respectively. 
The  Az  values  of  the  six  radiologists  for 
the  readings  with  and  without  CAD  are 
listed  in  Table  1. 

For  the  second  experiment  of  reading 
the  76  pairs  of  two-view  mammograms, 
the  ROC  curves  for  the  readings  by  the  six 
radiologists  both  without  and  with  CAD 
are  shown  in  Figures  7a  and  Figure  7b, 
respectively.  The  Az  values  of  the  six 
radiologists  in  this  experiment  are  also 
listed  in  Table  1. 
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Figure  6.  ROC  curves  for  the  six  observers  for  single- view  reading  of  the  masses  (a)  without  CAD  and  (b)  with  CAD.  (a,  b)  R1  =  reader  1,  R2  =  reader 
2,  R3  =  reader  3 ,R4  =  reader  4,  RS  =  reader  5,  R6  =  reader  6.  Five  of  the  six  observers  achieved  an  increase  in  the  area  under  the  ROC  curve,  Az,  with 
CAD. 


TABLE  1 

Areas  under  the  ROC  Curves  for  the  Classification  of  Masses  with  and  without 
CAD  by  the  Six  Radiologists 


Az  (Single  View)*  Az  (Two  View)t 


Radiologist 

No. 

Without 

CAD 

With 

CAD 

Without 

CAD 

With 

CAD 

1 

0.84  ±  0.03 

0.87  ±  0.02 

0.90  ±  0.03 

0.93  ±  0.03 

2 

0.92  ±  0.02 

0.96  ±  0.01 

0.95  ±  0.02 

0.97  ±  0.02 

3 

0.86  ±  0.02 

0.91  ±  0.02 

0.92  ±  0.03 

0.93  ±  0.03 

4 

0.79  ±  0.03 

0.87  ±  0.02 

0.88  ±  0.04 

0.95  ±  0.03 

5 

0.86  ±  0.02 

0.92  ±  0.02 

0.93  ±  0.03 

0.97  ±  0.02 

6 

0.89  ±  0.02 

0.87  ±  0.02 

0.89  ±  0.04 

0.93  ±  0.03 

.from  average  a,  b 
parameters 

0.87 

0.91 

0.92 

0.96 

Note. — Data  are  the  mean  ±  SD. 

*  P  —  .022  for  the  difference  between  the  Az  values  measured  with  CAD  and  those  measured 
without  CAD,  as  determined  by  using  the  Student  two-tailed  f  test.  P  =  .020  for  this  difference,  as 
determined  by  using  the  Dorfman-Berbaum-Metz  method. 

t  P  =  .007  for  the  difference  between  Az  values  measured  with  CAD  and  those  measured  without 
CAD,  as  determined  by  using  the  Student  two-tailed  t  test.  P  =  .026  for  this  difference,  as 
determined  by  using  the  Dorfman-Berbaum-Metz  method. 


The  average  ROC  curve  was  derived 
from  the  average  a  and  b  parameters  of 
the  six  individual  ROC  curves  for  a  given 
reading  condition  (27).  The  average  ROC 
curves  for  the  four  reading  conditions  are 
shown  in  Figure  8.  The  Az  values  of  the 
average  ROC  curves  are  listed  in  Table  1. 

For  the  reading  of  the  single- view  mam¬ 
mograms,  the  performance  of  the  com¬ 
puter  classifier  was  comparable  to  that  of 
the  radiologist  (reader  2)  who  had  the 
highest  classification  accuracy  (compare 
Figs  5  and  6)  and  higher  than  the  average 
performance  of  the  six  radiologists  (com¬ 
pare  Figs  5  and  8).  When  the  radiologists 
read  the  images  with  the  computer  aid, 
the  classification  accuracy  of  five  radiolo¬ 
gists  improved  (Table  1);  the  improve¬ 
ment  in  their  Az  values  ranged  from  0.04 
to  0.08.  The  average  performance  of  the 
six  radiologists  became  comparable  to 
that  of  the  computer  classifier.  The  im¬ 
provement  in  the  radiologists'  classifica¬ 
tion  accuracy  by  using  CAD  was  statisti¬ 
cally  significant  (P  =  .022,  Student  paired 
t  test;  P  =  .020,  Dorfman-Berbaum-Metz 
method).  Reader  2  with  CAD  obtained  an 
Az  value  of  0.96,  which  was  higher  than 
that  obtained  by  the  radiologist  alone  or 
by  the  computer  alone. 

A  trend  similar  to  that  with  the  single¬ 
view  readings  was  observed  with  the  two- 
view  readings.  The  Az  value  of  the  com¬ 
puter  classifier  for  the  corresponding  152 


single-view  masses  was  0.91  ±  0.02.  The 
classification  accuracy  of  all  six  radiolo¬ 
gists  improved  when  they  read  the  mam¬ 
mograms  with  the  computer  aid.  The 
increase  in  the  Az  values  ranged  from  0.01 
to  0.07.  The  improvement  was  statisti¬ 
cally  significant  (P  =  .007,  Student  paired 
t  test;  P  =  .026,  Dorfman-Berbaum-Metz 
method).  With  CAD,  two  radiologists 
achieved  an  Az  value  of  0.97,  which  was 
higher  than  that  obtained  by  the  radiolo¬ 


gists  alone  or  by  the  computer  alone. 
These  results  indicate  that  the  second 
opinion  provided  by  the  computer  classi¬ 
fier  might  have  strengthened  the  radiolo¬ 
gists'  confidence  in  the  interpretation  of 
some  difficult  cases  but  had  less  influence 
on  the  radiologists'  decision  when  the 
computer  made  mistakes  or  when  the 
radiologists  were  confident  about  their 
decision. 

As  can  be  seen  from  the  data  in  Table  1, 
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Figure  7.  ROC  curves  for  the  six  observers  for  two-view  reading  of  the  masses  (a)  without  CAD  and  (b)  with  CAD.  (a,  b)  R1  =  reader  1  ,R2  —  reader  2, 
R3  =  reader  3,  R4  =  reader  4,  R5  =  reader  5,  R6  -  reader  6.  All  six  observers  achieved  an  increase  in  the  area  under  the  ROC  curve,  Az,  with  CAD. 
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Figure  8.  Average  ROC  curve  obtained  from  the  average  a  and  b 
parameters  of  the  six  individual  ROC  curves  for  each  of  the  four 
reading  conditions.  An  improved  ROC  curve  was  achieved  with  CAD 
in  both  the  single-view  and  two-view  reading  experiments. 


the  radiologists7  accuracy  in  classifying 
masses  by  reading  two-view  mammo¬ 
grams  was  consistently  higher  than  that 
by  reading  single-view  mammograms  (P  = 
.008).  This  trend  remained  when  they 
read  the  mammograms  with  CAD  (P  = 
.007).  These  findings  are  consistent  with 


the  clinical  experience  of  the  radiologists 
that  at  least  two  views  of  mammograms 
are  needed  to  effectively  evaluate  a  suspi¬ 
cious  lesion. 

The  PPV  as  a  function  of  the  false¬ 
negative  fraction  was  derived  from  the 
fitted  ROC  curves  under  the  assumption 


that  the  prevalence  of  malignant  masses 
was  25%  in  the  population  of  masses  sent 
for  biopsy.  The  PPVs  estimated  for  the  six 
observers  who  read  the  two-view  mammo¬ 
grams  with  and  without  CAD  are  plotted 
in  Figure  9.  CAD  would  provide  an  im¬ 
provement  in  the  PPV  in  the  high  false¬ 
negative  fraction  range  for  all  observers 
except  readers  2  and  5.  The  increase  in 
the  PPV  at  a  decision  threshold  of  "no 
missed  malignant  mass”  (ie,  false-nega¬ 
tive  fraction  =  0)  varied  over  a  wide 
range;  the  largest  gain,  39%,  would  be 
achieved  by  reader  2,  and  the  smallest 
gain,  0%,  would  be  achieved  by  reader  4. 


DISCUSSION 


In  the  observer  experiment  of  reading 
two-view  mammograms  with  CAD,  we 
presented  the  computer's  rating  of  each 
view  separately.  The  decision  of  how  to 
merge  the  computer  ratings  of  the  two 
views  was  left  to  the  radiologist.  It  is  likely 
that  the  radiologists  took  the  conserva¬ 
tive  approach  of  using  the  highest  malig¬ 
nancy  rating  of  the  two  as  the  computer's 
overall  rating.  However,  it  also  might 
have  depended  on  whether  the  relative 
ranking  between  the  two  computer  rat¬ 
ings  agreed  with  the  observer's  opinion. 
In  some  cases,  we  observed  that  the  radi¬ 
ologist's  rating  was  very  different  from 
the  computer's  rating  of  either  view. 
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Figure  9.  PPV  as  a  function  of  the  false-negative  fraction  derived  from  the  ROC  curves  for  the  six  observers  (Fig  7).  The  PPV  was  predicted  for  a 
population  of  masses  in  which  biopsy  was  likely  to  be  performed  under  current  clinical  criteria  and  by  assuming  the  prevalence  of  malignant  masses 
to  be  25%.  R1  =  reader  1,  R2  -  reader  2,  R3  =  reader  3,  R4  =  reader  4,  R5  =  reader  5,  R6  =  reader  6. 


Because  decision  making  is  a  complex 
process,  the  simple  approach  of  using  the 
highest  malignant  rating  or  the  average 
rating  from  multiple  views  may  not  be  the 
method  preferred  by  radiologists.  The  sepa¬ 
rate  ratings  that  we  used  in  this  study  would 
provide  less  biased  information.  Further  in¬ 
vestigation  is  needed  to  determine  the  best 
approach  of  presenting  the  computer's  rat¬ 
ings  to  radiologists  in  clinical  practice. 

To  obtain  insight  into  how  the  radiolo¬ 
gists  might  use  the  two-view  informa¬ 
tion,  we  compared  the  classification  re¬ 
sults  from  their  true  two-view  reading 
with  those  from  a  simulated  two-view 
reading  without  the  computer  aid.  The 
latter  results  were  derived  from  ratings  of 
single-view  readings  of  the  same  76  pairs 
of  mammograms  interpreted  in  experi¬ 
ment  2  by  assuming  two  strategies — one 
in  which  the  highest  malignancy  rating 
between  the  two  ratings  was  used,  and 
the  other  in  which  the  average  of  the  two 
ratings  was  used  (Table  2).  The  Az  values 
for  these  classification  ratings  derived 
from  the  single-view  reading  are  listed  in 
Table  2.  The  corresponding  Az  values  for 
the  computer  classifier  are  also  given  in 
Table  2  for  comparison. 
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The  Az  values  for  the  maximal  rating 
and  the  average  rating  were  similar.  Four 
of  the  radiologists  obtained  higher  Az 
values  at  the  true  two-view  reading;  the 
Az  values  obtained  by  the  remaining  two 
radiologists  were  lower  than  those  ob¬ 
tained  at  the  simulated  two-view  reading. 
Although  the  difference  did  not  achieve 
statistical  significance  (P  =  .37)  and  both 
readings  included  intraobserver  varia¬ 
tions,  there  seemed  to  be  a  slight  trend 
toward  the  true  two-view  reading  being 
more  accurate  than  the  simulated  two- 
view  reading.  This  may  indicate  that  the 
radiologists  used  a  more  complex  deci¬ 
sion-making  process  to  interpret  the  two 
views  of  the  masses  than  that  of  simply 
maximizing  or  averaging  the  ratings  from 
each  view. 

In  this  study,  the  discriminant  scores  of 
the  masses  given  by  the  computer  classi¬ 
fier  were  transformed  into  a  relative  malig¬ 
nancy  rating.  The  relative  malignancy 
rating  scale  and  the  distribution  of  the 
malignant  and  benign  masses  along  the 
relative  rating  scale  were  explained  to  the 
observers  in  the  training  sessions.  A  rela¬ 
tive  malignancy  rating  scale  was  used 
because  the  true  likelihood  of  malig- 


TABLE  2 

Estimation  of  the  Malignancy 
Classification  of  76  Masses  by 
Two-View  Reading,  as  Simulated  from 
Single-View  Reading  of 
Mammograms  by  Radiologists 
without  CAD 


Az 


Radiologist 

No. 

Maximal 

Rating 

Average 

Rating 

1 

0.94  ±  0.03 

0.93  ±  0.03 

2 

0.94  ±  0.03 

0.94  ±  0.03 

3 

0.84  ±  0.05 

0.86  ±  0.04 

4 

0.85  ±  0.04 

0.83  ±  0.05 

5 

0.88  ±  0.04 

0.89  ±  0.04 

6 

0.91  ±  0.03 

0.92  ±  0.03 

Computer 

0.96  ±  0.02 

0.96  ±  0.02 

Note. — Data  are  the  mean  ±  SD.  Two  strate¬ 
gies  were  used:  In  one,  the  highest  of  the 
malignancy  ratings  on  each  view  was  used;  in 
the  other,  the  average  between  the  two  rat¬ 
ings  was  used. 


nancy  of  the  masses  could  not  be  esti¬ 
mated  from  a  small  data  set,  as  will  be 
explained.  However,  the  relative  rating 
scale  provided  by  the  computer  was  ad- 
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Figure  10.  Histograms  illustrate  the  confidence  ratings  of  reader  5  obtained  by  reading  76  two-view  mammograms  (a)  without  CAD  and  (b)  with 
CAD.  The  specificity  of  reader  5  at  100%  sensitivity  would  increase  from  5%  (two  of  37  masses)  without  CAD  to  68%  (25  of  37  masses)  with  CAD  if  an 
appropriate  decision  threshold  were  chosen. 


equate  for  measuring  the  relative  perfor-  if  it  is  analyzed  within  a  data  set  that  has  a  malignant  cases  ranged  from  2  to  10.  This 

mance  of  classification  with  and  without  lower  prevalence  of  malignant  cases  than  is  consistent  with  the  fact  that  biopsy  was 

CAD  in  an  ROC  study.  that  in  the  current  data  set.  performed  in  all  masses  in  the  data  set  to 

If  a  computer  classifier  is  trained  and  Training  the  participating  radiologists  avoid  missing  the  malignant  cases.  With 
tested  with  very  large  data  sets,  and  if  with  a  “likelihood  of  malignancy"  de-  CAD,  there  was  marked  improvement  in 

both  the  malignant  and  benign  cases  rived  from  a  small  data  set  for  the  ob-  the  separation  of  the  two  distributions.  It 

represent  random  samples  of  the  popula-  server  experiment  may  mislead  them  if  is  possible  to  set  a  decision  threshold  at  a 

tion,  then  the  likelihood  of  malignancy  they  encounter  a  similar  mass  in  their  confidence  rating  of  4,  below  which  bi- 

of  a  classified  mass  can  be  estimated  on  clinical  practice.  We,  therefore,  preferred  opsy  would  not  need  to  be  performed  and 

the  basis  of  the  probability  distributions  to  use  a  “relative  malignancy  rating,"  no  malignant  masses  would  be  missed, 

of  the  classifier's  test  output  scores  and  which  is  independent  of  the  prevalences  The  number  of  benign  masses  that  could 

the  prevalence  of  the  two  classes  of  masses  of  malignant  and  benign  masses  in  the  be  identified  without  missing  a  malig- 

in  the  patient  population.  However,  with  data  set.  As  long  as  the  same  classifier  and  nant  mass  by  setting  an  appropriate 

a  relatively  small  data  set,  such  as  that  the  same  linear  transformation  are  used  threshold  would  increase  by  23  (out  of  76 

used  in  this  and  other  observer  studies  for  classifying  masses,  the  relative  malig-  cases)  for  reader  5.  Five  of  the  six  radiolo- 

(14),  there  are  limitations.  First,  the  perfor-  nancy  rating  for  a  given  mass  will  remain  gists  in  our  ROC  study  achieved  an  im- 

mance  of  a  classifier  trained  with  a  small  the  same,  regardless  of  the  types  of  other  provement  in  distinguishing  benign  from 

sample  set  may  have  large  bias  and  vari-  masses  in  the  data  set.  When  a  computer  malignant  masses,  and  one  radiologist 

ance  (29-31).  Second,  the  data  set  in  this  classifier  is  implemented  in  a  clinical  had  no  difference.  Although  the  improve- 

study  did  not  include  masses  on  which  setting  and  its  performance  can  be  estab-  ment  of  the  five  radiologists  varied  over  a 

biopsy  was  not  performed,  so  it  did  not  lished  in  the  patient  population,  the  true  wide  range,  from  one  to  25  cases,  this 

represent  a  random  sample  of  the  masses  likelihood  of  malignancy  of  a  given  mass  result  indicates  a  strong  possibility  that 

in  the  patient  population.  If  our  classifier  can  be  estimated  and  provided  to  the  CAD  can  be  used  to  reduce  the  number  of 

were  applied  to  all  cases  of  solid  masses  in  radiologist.  The  true  likelihood  of  malig-  unnecessary  biopsies, 

clinical  practice,  the  probability  distribu-  nancy  may  be  a  more  informative  mea-  The  large  variation  in  improvement 
tion  of  the  test  scores  for  the  two  classes  sure  for  radiologists  in  the  clinical  applica-  among  the  radiologists  may  have  been 

of  masses  would  be  different  from  that  of  tion  of  CAD.  due  to  the  different  degrees  of  confidence 

the  current  data  set.  For  the  reading  of  the  76  two-view  that  they  had  in  the  computer  aid.  As 

If  we  ignore  the  patient  population  at  mammograms,  the  results  of  the  ROC  with  any  new  diagnostic  tool,  this  confi- 

large,  it  is  possible  to  estimate  the  likeli-  study  indicated  an  improvement  in  the  dence  is  influenced  by  the  experience  the 

hood  of  malignancy  of  a  mass  on  the  Az  value  for  all  six  radiologists  when  the  radiologist  has  with  the  tool.  Although 

basis  of  the  probability  distribution  of  the  computer  aid  was  used.  This  indicates  an  the  radiologists  received  training  before 

classifier  output  scores  by  using  the  preva-  overall  increase  in  the  separation  of  confi-  the  reading  sessions,  the  high  variability 

lence  of  the  two  classes  of  masses  in  this  dence  rating  distributions  between  the  in  confidence  was  not  unexpected,  be- 

specific  data  set.  However,  the  likelihood  malignant  and  benign  cases.  The  histo-  cause  this  ROC  study  was  the  first  in- 

of  malignancy  derived  in  this  way  will  be  grams  in  Figure  10  illustrate  the  distribu-  stance  in  which  they  had  worked  with 

completely  different  from  the  true  likeli-  tions  of  confidence  ratings  with  and  with-  the  computer  aid.  Their  confidence  levels 

hood  of  malignancy  of  a  mass  in  the  out  CAD  for  reader  5,  who  achieved  the  may  have  also  been  reflected  in  the  rela- 

patient  population.  This  can  be  easily  second  greatest  improvement  in  both  the  tively  low  accuracy  of  classification  by 

seen  if  one  considers  that  the  same  mass  Az  value  (Table  1)  and  the  separation  of  some  radiologists  with  CAD  compared 

with  the  same  discriminant  score  will  malignant  from  benign  distributions,  with  that  of  the  computer  classifier  alone, 
have  a  smaller  likelihood  of  malignancy  Without  CAD,  this  reader's  ratings  of  the  If  a  radiologist  can  increase  his  or  her 
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confidence  in  the  performance  of  a  com¬ 
puter  aid  by  gaining  more  extensive  clini¬ 
cal  experience,  then  he  or  she  will  likely 
be  able  to  find  the  most  effective  way  of 
merging  his  or  her  judgment  with  the 
computer's  rating  and  thus  reduce  both 
interobserver  and  intraobserver  variabil¬ 
ity.  Because  a  radiologist  who  uses  CAD 
can  establish  a  meaningful  decision 
threshold  for  biopsy  only  after  becoming 
familiar  with  the  sensitivity  and  specific¬ 
ity  of  working  with  CAD,  the  radiologists 
in  this  study  were  not  asked  to  decide 
whether  biopsy  should  have  been  per¬ 
formed  on  a  mass.  Rather,  we  focused  on 
the  evaluation  of  changes  in  the  sensitiv¬ 
ity  and  specificity  of  the  radiologists' 
classification  of  masses  when  CAD  was 
used. 

In  this  ROC  study,  all  six  observers 
were  attending  radiologists  with  exten¬ 
sive  experience  in  the  interpretation  of 
mammograms.  It  is  possible  that  the  com¬ 
puter  aid  may  be  even  more  useful  to 
radiology  residents  or  radiologists  with 
less  experience  in  mammography.  The 
effect  of  CAD  on  mammographic  interpre¬ 
tation  by  less-experienced  readers  will  be 
a  subject  of  investigation  in  future  stud¬ 
ies. 

The  observers  were  allowed  unlimited 
time  to  read  each  case  in  this  ROC  study. 
To  obtain  an  estimate  of  the  change  in 
reading  time  with  CAD,  we  recorded  the 
reading  time  of  each  observer  in  each 
reading  session  by  using  a  stopwatch.  For 
the  single-view  reading  experiment,  the 
average  reading  time  per  image  without 
CAD  varied  from  4.3  seconds  to  17.1 
seconds  (mean  time  for  the  six  observers, 
7.8  seconds).  The  average  reading  time 
per  image  with  CAD  varied  from  4.2 
seconds  to  17.3  seconds  (mean  time,  7.3 
seconds).  For  the  two- view  reading  experi¬ 
ment,  the  average  reading  time  per  pair  of 
images  without  CAD  varied  from  6.6  sec¬ 
onds  to  16.0  seconds  (mean  time,  10.4 
seconds).  The  average  reading  time  per 
pair  of  images  with  CAD  varied  from  7.6 
seconds  to  27.1  seconds  (mean  time,  13.5 
seconds). 

The  reading  time  essentially  did  not 
change  with  use  of  the  computer  aid  for 
the  single-view  readings.  For  the  two- 
view  readings,  the  radiologists  took  longer 
with  CAD,  probably  because  they  had  to 
merge  the  two  computer  ratings  and 
merge  the  computer  ratings  with  their 
own  evaluations.  Further  investigation  is 
needed  to  determine  whether  there  is  a 
trade-off  between  the  radiologist's  effi¬ 
ciency  and  the  method  of  presenting  the 
computer  rating  and  whether  the  reading 
time  with  CAD  will  depend  on  the  experi¬ 


ence  that  the  radiologist  has  with  the 
computer  information. 

In  the  observer  study,  we  used  laser- 
printed  mammograms  instead  of  the  origi¬ 
nal  mammograms  for  the  reading  experi¬ 
ments.  A  major  reason  is  that  it  is  difficult 
to  keep  all  the  original  mammograms 
together  for  the  entire  period  of  the  study 
because  they  are  part  of  active  patient 
files  and  thus  often  recalled  for  compari¬ 
son  with  new  studies  or  for  other  clinical 
reasons.  Because  the  maximum  optical 
density  of  laser-printed  images  was  3.1 
for  the  laser  imager  used,  the  contrast  on 
the  printed  mammograms  was  about  20% 
lower  than  that  on  the  original  mammo¬ 
grams.  Although  the  image  quality  was 
slightly  lower  than  that  of  the  original, 
the  laser-printed  digitized  images  were 
judged  to  be  adequate  for  reading  the 
details  of  the  masses  by  the  participating 
radiologists.  The  laser-printed  image  set 
might  also  be  considered  as  one  that  had 
slightly  more  subtle  masses  than  the  origi¬ 
nal  set  of  images.  Because  the  relative 
performance  of  two  modalities  is  mea¬ 
sured  in  ROC  experiments,  and  because 
the  readings  both  with  and  without  CAD 
in  this  study  were  conducted  with  the 
same  set  of  printed  images,  the  relative 
performance  of  the  two  readings  should 
be  valid.  It  should  also  be  noted  that  in 
order  for  a  computer  aid  that  uses  auto¬ 
mated  image  analysis  to  be  widely  ac¬ 
cepted,  direct  digital  mammography 
would  have  to  be  the  imaging  modality 
in  clinical  use.  Laser-printed  images  or 
soft-copy  monitors  will  be  the  display 
medium  for  the  digital  mammograms. 
The  use  of  laser-printed  images  for  this 
ROC  study  was  therefore  practical. 

In  our  observer  performance  experi¬ 
ment,  we  found  that  CAD  improved  the 
radiologists'  ability  to  distinguish  malig¬ 
nant  and  benign  masses.  This  is  consis¬ 
tent  with  the  results  of  other  studies 
(11,14)  in  which  a  statistically  significant 
improvement  (P  <  .001  in  both  studies) 
in  the  radiologists'  classification  accuracy 
by  using  CAD  was  found.  The  results  of 
the  former  study  (11)  further  showed  that 
the  PPV  of  a  recommendation  for  biopsy 
by  the  radiologists  was  significantly  in¬ 
creased  (P  <  .001).  In  our  approach,  the 
computer  classifier  automatically  ex¬ 
tracted  image  features,  whereas  in  the 
other  studies,  the  computer  classifier  used 
the  radiologist's  evaluation  and  other  pa¬ 
tient  information  as  input.  Therefore,  it 
appears  that  CAD  can  provide  a  useful 
second  opinion  to  radiologists,  either  by 
consistently  extracting  and  analyzing  the 
image  features  or  by  optimally  weighting 
various  diagnostic  factors  and  thereby 


improving  the  consistency  in  the  deci¬ 
sion-making  process.  This  suggests  that  a 
computer  classifier  that  combines  both 
approaches — that  is,  automatically  ex¬ 
tracts  image  features  and  optimally 
merges  them  with  the  radiologist's  evalu¬ 
ation  and  patient  information — may  be 
even  more  effective  for  breast  cancer  diag¬ 
nosis.  The  latter  step  will  also  improve 
the  radiologist's  utilization  of  the  com¬ 
puter  rating  on  the  basis  of  the  computer- 
extracted  features;  this  utilization  was 
found  to  have  large  interobserver  varia¬ 
tion  in  our  ROC  experiment. 

In  conclusion,  an  ROC  study  of  the 
effects  of  CAD  on  radiologists'  classifica¬ 
tion  of  malignant  and  benign  masses  on 
mammograms  was  conducted.  The  re¬ 
sults  showed  that  CAD  can  provide  a 
statistically  significant  improvement  in 
the  classification  accuracy — that  is,  in  the 
Az  value — for  both  single-view  reading 
(P  =  .022)  and  two-view  reading  (P  = 
.007).  The  improved  separation  between 
the  confidence  ratings  of  the  malignant 
masses  and  those  of  the  benign  masses 
indicates  the  potential  that  CAD  may 
reduce  the  rate  of  biopsy  of  benign  masses 
when  decision  thresholds  are  properly 
chosen  by  the  radiologists.  The  decision 
threshold  may  vary  among  radiologists, 
as  in  the  case  of  mammographic  interpre¬ 
tation  without  CAD,  and  can  be  set  after 
the  radiologist  working  with  CAD  has 
established  his  or  her  sensitivity  and  speci¬ 
ficity  with  this  approach  through  clinical 
experience. 

Further  studies  are  needed  to  evaluate 
the  effects  of  CAD  on  the  accuracy  of 
radiologist  classification  of  masses  in  clini¬ 
cal  settings  in  which  the  prevalence  of 
malignant  masses  is  different  from  that  in 
a  laboratory  data  set  and  the  likelihood  of 
malignancy  of  a  mass  can  be  estimated  by 
the  computer  classifier.  In  the  two-view 
reading  ROC  experiment,  the  reading  time 
per  case  increased  by  about  30%  with  the 
use  of  CAD.  The  dependence  of  the  radi¬ 
ologist's  efficiency  in  reading  with  CAD 
on  the  presentation  method  and  on  the 
reader's  experience  in  using  the  computer 
information  also  warrants  further  investi¬ 
gation. 
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